[Rdo-list] rdoproject.org infra outage post-mortem

Thursday, 10 March 2016

Hello folks,

here is a little note about what happened... The underlying cloud
(called rcip-dev) experienced an outage related to rabbitmq cluster
inconsitency:
* No service was not able to connect to rabbitmq on port 5672
* Restarting controllers didn't helped until the rabbitmq cluster was
  killed and recreated.
* From that point, qrouter lost the VIP and no VRRP packets was sent.
* Restarting the controller one by one after cleaning rabbitmq cluster
  solved that issue.

Timeline:
* 00:00 UTC: API started to timeout
* 17:30 UTC: Network connections was lost
* 18:15 UTC: Network connectivity restored

Services' API and instances are now nominals.
Special kudos to Cédric Lecomte for fixing the cloud and saving the day!

Regards,
-Tristan

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

[Rdo-list] rdoproject.org infra outage post-mortem