Re: [rdo-dev] [rhos-dev] [infra][outage] Nodepool outage on review.rdoproject.org, December 2

Monday, 11 December 2017

----- Original Message -----
...
 On December 3, 2017 9:27 pm, Paul Belanger wrote:
 [snip]
 > Please reach out to me the next time you restart it, something is seriously
 > wrong is we have to keep restarting nodepool every few days.
 > At this rate, I would even leave nodepool-launcher is the bad state until
 > we inspect it.
 > 
 > Thanks,
 > PB
 > 

 Hello,

 nodepoold was stuck again. Before restarting it I dumped the thread's
 stack-trace and
 it seems like 8 threads were trying to aquire a single lock (futex=0xe41de0):
 https://review.rdoproject.org/paste/show/9VnzowfzBogKG4Gw0Kes/

 This make the main loop stuck at
 http://git.openstack.org/cgit/openstack-infra/nodepool/tree/nodepool/node...

 I'm not entirely sure what caused this deadlock, the other threads involved
 are quite complex:
 * kazoo zk_loop
 * zmq received
 * apscheduler mainloop
 * periodicCheck paramiko client connect
 * paramiko transport run
 * nodepool webapp handle request

 Next time, before restarting the process, it would be good to know what
 thread is actually holding the lock, using (gdb) py-print, as explained
 here:

https://stackoverflow.com/questions/42169768/debug-pythread-acquire-lock-...

 Paul: any other debug instructions would be appreciated.

Hello,

As a follow-up: the Zuul queue for rdoinfo, DLRN-rpmbuild and other jobs using the
rdo-centos-7/rdo-centos-7-ssd nodes was moving very slowly. After checking, there were
multiple nodes seen by nodepool as "ready", but those nodes were not in jenkins.
For example:

+-------+-------------------+------+--------------------------+------+---------+-------------+----------
| ID    | Provider          | AZ   | Label                    |...   | State   | Age      
  | Comment |
+-------+-------------------+------+--------------------------+------+---------+-------------+---------+
| 62045 | rdo-cloud         | None | rdo-centos-7             | ...  | read    |
01:10:24:24 | None    |
| 62047 | rdo-cloud         | None | rdo-centos-7             | ...  | ready   |
01:10:24:19 | None    |

The queue was only moving when there were more pending requests than nodes in this state,
since that is when nodepool tries to build new nodes. I have manually removed them to
allow the reviews to move on.

This is already documented in the etherpad at
https://review.rdoproject.org/etherpad/p/nodepool-infra-debugging.

Regards,
Javier

...
 Regards,
 -Tristan

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Re: [rdo-dev] [rhos-dev] [infra][outage] Nodepool outage on review.rdoproject.org, December 2