[rdo-dev] [rhos-dev] [infra][outage] Nodepool outage on review.rdoproject.org, December 2

Tristan Cacqueray tdecacqu at redhat.com
Sat Dec 2 15:04:49 UTC 2017


On December 2, 2017 12:57 pm, Alfredo Moralejo Alonso wrote:
> On Sat, Dec 2, 2017 at 11:56 AM, Javier Pena <jpena at redhat.com> wrote:
> 
>> Hi all,
>>
>> We had another nodepool outage this morning. Around 9:00 UTC, amoralej
>> noticed that no new jobs were being processed. He restarted nodepool, and I
>> helped him later with some stale node cleanup. Nodepool started creating
>> VMs successfully around 10:00 UTC.
>>
>> On a first look at the logs, we see no new messages after 7:30 (not even
>> DEBUG logs), but I was unable to run more troubleshooting steps because the
>> service was already restarted.
>>
That's odd, though the root logger was still at the WARNING loglevel,
I've bump it to DEBUG too so that hopefully we'll get more logs
from gear, shade, paramiko.

>>
> In case it helps, i could run successfully both "nodepool list" and
> "nodepool delete <id> --now" (for a couple of instances in delete status)
> before restarting nodepool.
>
IIRC, the nodepool command talk directly to the database and the provider,
so it's a separate process from the nodepool-launcher service.

> However nothing appeared in logs and no
> instances were created for jobs in queue so i restarted nodepool-launcher
> (my understanding was that it fixed similar situations in the past) before
> Javier started working on it.
> 
FTR, the service is called nodepool-launcher, but in nodepoolv2
terminology it's actually the nodepoold daemon.

> 
>> We will go through the logs on Monday to investigate what happened during
>> the outage.
>>
It seems like the services stopped at 07:30, it should have been dumping
AllocationRequest debug every 10 seconds. It's unclear what the service
was doing, perhaps the next time we could try dumping it's stacktrace
like so (gdb and python debuginfo are now already installed):

gdb --batch --eval-command="thread apply all bt" --eval-command=bt -p \
  $(ps ax | grep nodepoold | awk '/python2/ { print $1 }') > /root/nodepoold.debug

Regards,
-Tristan

>> Regards,
>> Javier
>>
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <http://lists.rdoproject.org/pipermail/dev/attachments/20171202/f5ae3e0d/attachment.sig>


More information about the dev mailing list