[Rdo-list] rdoproject.org infra outage post-mortem

Alan Pevec alan.pevec at redhat.com
Fri Mar 11 07:01:01 UTC 2016


10.03.2016 20:54, Tristan Cacqueray wrote:
> Hello folks,
> 
> here is a little note about what happened... The underlying cloud
> (called rcip-dev) experienced an outage related to rabbitmq cluster
> inconsitency:
> * No service was not able to connect to rabbitmq on port 5672
> * Restarting controllers didn't helped until the rabbitmq cluster was
>   killed and recreated.
> * From that point, qrouter lost the VIP and no VRRP packets was sent.
> * Restarting the controller one by one after cleaning rabbitmq cluster
>   solved that issue.
> 
> Timeline:
> * 00:00 UTC: API started to timeout
> * 17:30 UTC: Network connections was lost
> * 18:15 UTC: Network connectivity restored
> 
> Services' API and instances are now nominals.
> Special kudos to Cédric Lecomte for fixing the cloud and saving the day!

Thanks Tristan and Cedric!
Dogfooding is important so we can improve what we ship.
Do you do have any logs that we could give to Fabio and his team to
analyze what happened with those crazy wabbits, so we might get it fixed
and avoided in the future?

Thanks,
Alan




More information about the dev mailing list