10.03.2016 20:54, Tristan Cacqueray wrote:
Hello folks,
here is a little note about what happened... The underlying cloud
(called rcip-dev) experienced an outage related to rabbitmq cluster
inconsitency:
* No service was not able to connect to rabbitmq on port 5672
* Restarting controllers didn't helped until the rabbitmq cluster was
killed and recreated.
* From that point, qrouter lost the VIP and no VRRP packets was sent.
* Restarting the controller one by one after cleaning rabbitmq cluster
solved that issue.
Timeline:
* 00:00 UTC: API started to timeout
* 17:30 UTC: Network connections was lost
* 18:15 UTC: Network connectivity restored
Services' API and instances are now nominals.
Special kudos to Cédric Lecomte for fixing the cloud and saving the day!
Thanks Tristan and Cedric!
Dogfooding is important so we can improve what we ship.
Do you do have any logs that we could give to Fabio and his team to
analyze what happened with those crazy wabbits, so we might get it fixed
and avoided in the future?
Thanks,
Alan