Re: [Rdo-list] rdoproject.org infra outage post-mortem

Friday, 11 March 2016

10.03.2016 20:54, Tristan Cacqueray wrote:
...
 Hello folks,

 here is a little note about what happened... The underlying cloud
 (called rcip-dev) experienced an outage related to rabbitmq cluster
 inconsitency:
 * No service was not able to connect to rabbitmq on port 5672
 * Restarting controllers didn't helped until the rabbitmq cluster was
   killed and recreated.
 * From that point, qrouter lost the VIP and no VRRP packets was sent.
 * Restarting the controller one by one after cleaning rabbitmq cluster
   solved that issue.

 Timeline:
 * 00:00 UTC: API started to timeout
 * 17:30 UTC: Network connections was lost
 * 18:15 UTC: Network connectivity restored

 Services' API and instances are now nominals.
 Special kudos to Cédric Lecomte for fixing the cloud and saving the day! 
Thanks Tristan and Cedric!
Dogfooding is important so we can improve what we ship.
Do you do have any logs that we could give to Fabio and his team to
analyze what happened with those crazy wabbits, so we might get it fixed
and avoided in the future?

Thanks,
Alan

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Re: [Rdo-list] rdoproject.org infra outage post-mortem