[Rdo-list] rdoproject.org infra outage post-mortem

Fri Mar 11 14:39:34 UTC 2016

On 3/11/2016 3:34 PM, Tristan Cacqueray wrote:
> On 03/11/2016 09:07 AM, Fabio M. Di Nitto wrote:
>>
>>
>> On 3/11/2016 8:01 AM, Alan Pevec wrote:
>>> 10.03.2016 20:54, Tristan Cacqueray wrote:
>>>> Hello folks,
>>>>
>>>> here is a little note about what happened... The underlying cloud
>>>> (called rcip-dev) experienced an outage related to rabbitmq cluster
>>>> inconsitency:
>>>> * No service was not able to connect to rabbitmq on port 5672
>>>> * Restarting controllers didn't helped until the rabbitmq cluster was
>>>>   killed and recreated.
>>>> * From that point, qrouter lost the VIP and no VRRP packets was sent.
>>>> * Restarting the controller one by one after cleaning rabbitmq cluster
>>>>   solved that issue.
>>>>
>>>> Timeline:
>>>> * 00:00 UTC: API started to timeout
>>>> * 17:30 UTC: Network connections was lost
>>>> * 18:15 UTC: Network connectivity restored
>>>>
>>>> Services' API and instances are now nominals.
>>>> Special kudos to Cédric Lecomte for fixing the cloud and saving the day!
>>>
>>> Thanks Tristan and Cedric!
>>> Dogfooding is important so we can improve what we ship.
>>> Do you do have any logs that we could give to Fabio and his team to
>>> analyze what happened with those crazy wabbits, so we might get it fixed
>>> and avoided in the future?
>>
>> If you could please escalate the issue in a bugzilla and assign it to Petr.
>>
> 
> Unfortunately there are no obvious steps to reproduce beside using
> nodepool (which is quite good at loading openstack service...).
> 

That is fine, we can survive without reproducer hopefully.

> 
>> We need sosreports from the nodes involved in the accident and there
>> should be potentially rabbitmq core dumps (but I can´t remember the
>> location tho).
>>
> 
> I couldn't find any coredump on controllers (using find / -xdev -iname
> "*core*" -type f), and I don't know the plateform settings enough to use
> sosreports now.

Ok can we find somebody to just run sosreport? or provide access to the
env as alternative?

> 
> 
> TL;DR; imo the first real issue is rabbit cluster didn't checked service
> availability and was unable to report connection error to the service.
> 
> Then the second issue seems to be the shuffle strategy of
> oslo_messaging, in this case, even though only one controller had a
> rabbitmq issue, services seemed to always choose the same controller.
> 
> This really needs manual testing to better understand why services
> doesn't recover easily of such failure and how to fix it.

We are aware of many of those issues, but most of them have been fixed
already and starting from sosreports we can figure out what packages are
installed on the systems, if they need to be updates, every config for
every service and validate if they are optimal etc. etc.

I understand you are trying to be helpful and I appreciate it, but those
logs are not enough.

Thanks
Fabio