[Rdo-list] rdoproject.org infra outage post-mortem #2
Tristan Cacqueray
tdecacqu at redhat.com
Thu Mar 24 20:41:04 UTC 2016
Hello folks,
here is a little note about what happened... The underlying cloud
(called rcip-dev) experienced an outage yesterday morning.
Anomalies observed:
* Mar 23 04:09:20 - compute - first AMQP error
* Mar 23 08:11:48 - controller - OOM killed nova-scheduler
* Mar 23 08:47:52 - controller - RabbitMQ service was unavailable
Possible root causes:
* PCS tried to restart the scheduler repeatedly and may have caused
pending AMQP connections to stall, potentially exhausting its
resources
* Nova databases wasn't purged and since nova-scheduler cache the whole
instances table it was overrun.
* nova-manage db archive_deleted_rows failed because of IntegrityError:
a foreign key constraint fails
* This result in a total of 70k instances and more than 120k
instance_actions_events to be loaded in nova-scheduler each time.
-> Manually cleaning the nova database of all deleted instances reduce
nova-scheduler memory to 80KB (down from >3GB)
* Then once rabbitmq was down, services started to fail.
However, the timestamps doesn't exactly match with the
trunk.rdoproject.org instance outage. Investigation is still on-going.
Timeline:
Mar 23 07:40:00 - apevec noticed trunk.rdoproject.org lost connectivity
Mar 23 09:00:00 - rabbitmq cluster rebuilt, controller node restarted
Mar 23 09:50:00 - service restored
Mar 23 10:13:00 - trunk.rdoproject.org restored after hard reboot
Services' API and instances are now nominals.
Regards,
-Tristan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: OpenPGP digital signature
URL: <http://lists.rdoproject.org/pipermail/dev/attachments/20160324/64075add/attachment.sig>
More information about the dev
mailing list