[Rdo-list] rdoproject.org infra outage post-mortem

Fabio M. Di Nitto fdinitto at redhat.com
Fri Mar 11 09:07:58 UTC 2016



On 3/11/2016 8:01 AM, Alan Pevec wrote:
> 10.03.2016 20:54, Tristan Cacqueray wrote:
>> Hello folks,
>>
>> here is a little note about what happened... The underlying cloud
>> (called rcip-dev) experienced an outage related to rabbitmq cluster
>> inconsitency:
>> * No service was not able to connect to rabbitmq on port 5672
>> * Restarting controllers didn't helped until the rabbitmq cluster was
>>   killed and recreated.
>> * From that point, qrouter lost the VIP and no VRRP packets was sent.
>> * Restarting the controller one by one after cleaning rabbitmq cluster
>>   solved that issue.
>>
>> Timeline:
>> * 00:00 UTC: API started to timeout
>> * 17:30 UTC: Network connections was lost
>> * 18:15 UTC: Network connectivity restored
>>
>> Services' API and instances are now nominals.
>> Special kudos to Cédric Lecomte for fixing the cloud and saving the day!
> 
> Thanks Tristan and Cedric!
> Dogfooding is important so we can improve what we ship.
> Do you do have any logs that we could give to Fabio and his team to
> analyze what happened with those crazy wabbits, so we might get it fixed
> and avoided in the future?

If you could please escalate the issue in a bugzilla and assign it to Petr.

We need sosreports from the nodes involved in the accident and there
should be potentially rabbitmq core dumps (but I can´t remember the
location tho).

Thanks
Fabio




More information about the dev mailing list