[Rdo-list] rdoproject.org infra outage post-mortem

Fri Mar 11 14:34:08 UTC 2016

On 03/11/2016 09:07 AM, Fabio M. Di Nitto wrote:
> 
> 
> On 3/11/2016 8:01 AM, Alan Pevec wrote:
>> 10.03.2016 20:54, Tristan Cacqueray wrote:
>>> Hello folks,
>>>
>>> here is a little note about what happened... The underlying cloud
>>> (called rcip-dev) experienced an outage related to rabbitmq cluster
>>> inconsitency:
>>> * No service was not able to connect to rabbitmq on port 5672
>>> * Restarting controllers didn't helped until the rabbitmq cluster was
>>>   killed and recreated.
>>> * From that point, qrouter lost the VIP and no VRRP packets was sent.
>>> * Restarting the controller one by one after cleaning rabbitmq cluster
>>>   solved that issue.
>>>
>>> Timeline:
>>> * 00:00 UTC: API started to timeout
>>> * 17:30 UTC: Network connections was lost
>>> * 18:15 UTC: Network connectivity restored
>>>
>>> Services' API and instances are now nominals.
>>> Special kudos to Cédric Lecomte for fixing the cloud and saving the day!
>>
>> Thanks Tristan and Cedric!
>> Dogfooding is important so we can improve what we ship.
>> Do you do have any logs that we could give to Fabio and his team to
>> analyze what happened with those crazy wabbits, so we might get it fixed
>> and avoided in the future?
> 
> If you could please escalate the issue in a bugzilla and assign it to Petr.
> 

Unfortunately there are no obvious steps to reproduce beside using
nodepool (which is quite good at loading openstack service...).

However here are some more technical insights:

* the first tell was the lack of amqp queue consumer:
# rabbitmqctl list_queues messages consumers name | tail
168 0 reply_478513620c5f4be4a60e294b02dabe54
444 0 scheduler

In the logs, this translate to:
conductor.log: Timed out waiting for a reply to message ID
scheduler.log: The exchange Exchange reply_XXX(direct) to send to
reply_XXX doesn't exist yet, retrying...

* the second tell was that one of the rabbitmq service was not
available, admin port was working fine and cluster_status didn't
reported errors, however direct connections to port 5672 was unreachable:

ERROR oslo_messaging._drivers.impl_rabbit [req-XXX - - - - -] AMQP
server on controler_ip:5672 is unreachable: [Errno 32] Broken pipe.
Trying again in 1 seconds.

* then after rabbitmq service are restored, controller needed to be
powercycled to restore operation.

> We need sosreports from the nodes involved in the accident and there
> should be potentially rabbitmq core dumps (but I can´t remember the
> location tho).
> 

I couldn't find any coredump on controllers (using find / -xdev -iname
"*core*" -type f), and I don't know the plateform settings enough to use
sosreports now.

TL;DR; imo the first real issue is rabbit cluster didn't checked service
availability and was unable to report connection error to the service.

Then the second issue seems to be the shuffle strategy of
oslo_messaging, in this case, even though only one controller had a
rabbitmq issue, services seemed to always choose the same controller.

This really needs manual testing to better understand why services
doesn't recover easily of such failure and how to fix it.

Regards,
-Tristan

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: OpenPGP digital signature
URL: <http://lists.rdoproject.org/pipermail/dev/attachments/20160311/9ea64cd3/attachment.sig>