Re: [Rdo-list] rdoproject.org infra outage post-mortem

Friday, 11 March 2016

On 3/11/2016 8:01 AM, Alan Pevec wrote:
...
 10.03.2016 20:54, Tristan Cacqueray wrote:
> Hello folks,
>
> here is a little note about what happened... The underlying cloud
> (called rcip-dev) experienced an outage related to rabbitmq cluster
> inconsitency:
> * No service was not able to connect to rabbitmq on port 5672
> * Restarting controllers didn't helped until the rabbitmq cluster was
>   killed and recreated.
> * From that point, qrouter lost the VIP and no VRRP packets was sent.
> * Restarting the controller one by one after cleaning rabbitmq cluster
>   solved that issue.
>
> Timeline:
> * 00:00 UTC: API started to timeout
> * 17:30 UTC: Network connections was lost
> * 18:15 UTC: Network connectivity restored
>
> Services' API and instances are now nominals.
> Special kudos to Cédric Lecomte for fixing the cloud and saving the day!

 Thanks Tristan and Cedric!
 Dogfooding is important so we can improve what we ship.
 Do you do have any logs that we could give to Fabio and his team to
 analyze what happened with those crazy wabbits, so we might get it fixed
 and avoided in the future? 
If you could please escalate the issue in a bugzilla and assign it to Petr.

We need sosreports from the nodes involved in the accident and there
should be potentially rabbitmq core dumps (but I can´t remember the
location tho).

Thanks
Fabio

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Re: [Rdo-list] rdoproject.org infra outage post-mortem