[Rdo-list] rdoproject.org infra outage post-mortem #2

Thursday, 24 March 2016

Hello folks,

here is a little note about what happened... The underlying cloud
(called rcip-dev) experienced an outage yesterday morning.

Anomalies observed:

* Mar 23 04:09:20 - compute - first AMQP error
* Mar 23 08:11:48 - controller - OOM killed nova-scheduler
* Mar 23 08:47:52 - controller - RabbitMQ service was unavailable

Possible root causes:

* PCS tried to restart the scheduler repeatedly and may have caused
  pending AMQP connections to stall, potentially exhausting its
  resources

* Nova databases wasn't purged and since nova-scheduler cache the whole
  instances table it was overrun.
* nova-manage db archive_deleted_rows failed because of IntegrityError:
  a foreign key constraint fails
* This result in a total of 70k instances and more than 120k
  instance_actions_events to be loaded in nova-scheduler each time.
-> Manually cleaning the nova database of all deleted instances reduce
   nova-scheduler memory to 80KB (down from >3GB)

* Then once rabbitmq was down, services started to fail.

However, the timestamps doesn't exactly match with the
trunk.rdoproject.org instance outage. Investigation is still on-going.

Timeline:
 Mar 23 07:40:00 - apevec noticed trunk.rdoproject.org lost connectivity
 Mar 23 09:00:00 - rabbitmq cluster rebuilt, controller node restarted
 Mar 23 09:50:00 - service restored
 Mar 23 10:13:00 - trunk.rdoproject.org restored after hard reboot

Services' API and instances are now nominals.

Regards,
-Tristan

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013