[Rdo-list] rdoproject.org infra outage post-mortem

Tristan Cacqueray tdecacqu at redhat.com
Thu Mar 10 19:54:26 UTC 2016


Hello folks,

here is a little note about what happened... The underlying cloud
(called rcip-dev) experienced an outage related to rabbitmq cluster
inconsitency:
* No service was not able to connect to rabbitmq on port 5672
* Restarting controllers didn't helped until the rabbitmq cluster was
  killed and recreated.
* From that point, qrouter lost the VIP and no VRRP packets was sent.
* Restarting the controller one by one after cleaning rabbitmq cluster
  solved that issue.

Timeline:
* 00:00 UTC: API started to timeout
* 17:30 UTC: Network connections was lost
* 18:15 UTC: Network connectivity restored

Services' API and instances are now nominals.
Special kudos to Cédric Lecomte for fixing the cloud and saving the day!

Regards,
-Tristan

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: OpenPGP digital signature
URL: <http://lists.rdoproject.org/pipermail/dev/attachments/20160310/a453b01b/attachment.sig>


More information about the dev mailing list