Hello folks,
here is a little note about what happened... The underlying cloud
(called rcip-dev) experienced an outage yesterday morning.
Anomalies observed:
* Mar 23 04:09:20 - compute - first AMQP error
* Mar 23 08:11:48 - controller - OOM killed nova-scheduler
* Mar 23 08:47:52 - controller - RabbitMQ service was unavailable
Possible root causes:
* PCS tried to restart the scheduler repeatedly and may have caused
pending AMQP connections to stall, potentially exhausting its
resources
* Nova databases wasn't purged and since nova-scheduler cache the whole
instances table it was overrun.
* nova-manage db archive_deleted_rows failed because of IntegrityError:
a foreign key constraint fails
* This result in a total of 70k instances and more than 120k
instance_actions_events to be loaded in nova-scheduler each time.
-> Manually cleaning the nova database of all deleted instances reduce
nova-scheduler memory to 80KB (down from >3GB)
* Then once rabbitmq was down, services started to fail.
However, the timestamps doesn't exactly match with the
trunk.rdoproject.org instance outage. Investigation is still on-going.
Timeline:
Mar 23 07:40:00 - apevec noticed
trunk.rdoproject.org lost connectivity
Mar 23 09:00:00 - rabbitmq cluster rebuilt, controller node restarted
Mar 23 09:50:00 - service restored
Mar 23 10:13:00 -
trunk.rdoproject.org restored after hard reboot
Services' API and instances are now nominals.
Regards,
-Tristan