[Rdo-list] rdoproject.org infra outage post-mortem #2

Tristan Cacqueray tdecacqu at redhat.com
Thu Mar 24 20:41:04 UTC 2016


Hello folks,

here is a little note about what happened... The underlying cloud
(called rcip-dev) experienced an outage yesterday morning.

Anomalies observed:

* Mar 23 04:09:20 - compute - first AMQP error
* Mar 23 08:11:48 - controller - OOM killed nova-scheduler
* Mar 23 08:47:52 - controller - RabbitMQ service was unavailable


Possible root causes:

* PCS tried to restart the scheduler repeatedly and may have caused
  pending AMQP connections to stall, potentially exhausting its
  resources

* Nova databases wasn't purged and since nova-scheduler cache the whole
  instances table it was overrun.
* nova-manage db archive_deleted_rows failed because of IntegrityError:
  a foreign key constraint fails
* This result in a total of 70k instances and more than 120k
  instance_actions_events to be loaded in nova-scheduler each time.
-> Manually cleaning the nova database of all deleted instances reduce
   nova-scheduler memory to 80KB (down from >3GB)

* Then once rabbitmq was down, services started to fail.


However, the timestamps doesn't exactly match with the
trunk.rdoproject.org instance outage. Investigation is still on-going.

Timeline:
 Mar 23 07:40:00 - apevec noticed trunk.rdoproject.org lost connectivity
 Mar 23 09:00:00 - rabbitmq cluster rebuilt, controller node restarted
 Mar 23 09:50:00 - service restored
 Mar 23 10:13:00 - trunk.rdoproject.org restored after hard reboot

Services' API and instances are now nominals.

Regards,
-Tristan

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: OpenPGP digital signature
URL: <http://lists.rdoproject.org/pipermail/dev/attachments/20160324/64075add/attachment.sig>


More information about the dev mailing list