[rdo-dev] [rhos-dev] [infra][outage] Nodepool outage on review.rdoproject.org, December 2

Tristan Cacqueray tdecacqu at redhat.com
Mon Dec 11 08:49:03 UTC 2017

On December 3, 2017 9:27 pm, Paul Belanger wrote:
> Please reach out to me the next time you restart it, something is seriously
> wrong is we have to keep restarting nodepool every few days.
> At this rate, I would even leave nodepool-launcher is the bad state until we inspect it.
> Thanks,
> PB


nodepoold was stuck again. Before restarting it I dumped the thread's stack-trace and
it seems like 8 threads were trying to aquire a single lock (futex=0xe41de0):

This make the main loop stuck at

I'm not entirely sure what caused this deadlock, the other threads involved
are quite complex:
* kazoo zk_loop
* zmq received
* apscheduler mainloop
* periodicCheck paramiko client connect
* paramiko transport run
* nodepool webapp handle request

Next time, before restarting the process, it would be good to know what
thread is actually holding the lock, using (gdb) py-print, as explained

Paul: any other debug instructions would be appreciated.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <http://lists.rdoproject.org/pipermail/dev/attachments/20171211/7e9096a9/attachment.sig>

More information about the dev mailing list