[rdo-dev] [rhos-dev] [infra][outage] Nodepool outage on review.rdoproject.org, December 2
Javier Pena
jpena at redhat.com
Mon Dec 11 10:40:41 UTC 2017
----- Original Message -----
> On December 3, 2017 9:27 pm, Paul Belanger wrote:
> [snip]
> > Please reach out to me the next time you restart it, something is seriously
> > wrong is we have to keep restarting nodepool every few days.
> > At this rate, I would even leave nodepool-launcher is the bad state until
> > we inspect it.
> >
> > Thanks,
> > PB
> >
>
> Hello,
>
> nodepoold was stuck again. Before restarting it I dumped the thread's
> stack-trace and
> it seems like 8 threads were trying to aquire a single lock (futex=0xe41de0):
> https://review.rdoproject.org/paste/show/9VnzowfzBogKG4Gw0Kes/
>
> This make the main loop stuck at
> http://git.openstack.org/cgit/openstack-infra/nodepool/tree/nodepool/nodepool.py#n1281
>
> I'm not entirely sure what caused this deadlock, the other threads involved
> are quite complex:
> * kazoo zk_loop
> * zmq received
> * apscheduler mainloop
> * periodicCheck paramiko client connect
> * paramiko transport run
> * nodepool webapp handle request
>
> Next time, before restarting the process, it would be good to know what
> thread is actually holding the lock, using (gdb) py-print, as explained
> here:
> https://stackoverflow.com/questions/42169768/debug-pythread-acquire-lock-deadlock/42256864#42256864
>
> Paul: any other debug instructions would be appreciated.
>
Hello,
As a follow-up: the Zuul queue for rdoinfo, DLRN-rpmbuild and other jobs using the rdo-centos-7/rdo-centos-7-ssd nodes was moving very slowly. After checking, there were multiple nodes seen by nodepool as "ready", but those nodes were not in jenkins. For example:
+-------+-------------------+------+--------------------------+------+---------+-------------+----------
| ID | Provider | AZ | Label |... | State | Age | Comment |
+-------+-------------------+------+--------------------------+------+---------+-------------+---------+
| 62045 | rdo-cloud | None | rdo-centos-7 | ... | read | 01:10:24:24 | None |
| 62047 | rdo-cloud | None | rdo-centos-7 | ... | ready | 01:10:24:19 | None |
The queue was only moving when there were more pending requests than nodes in this state, since that is when nodepool tries to build new nodes. I have manually removed them to allow the reviews to move on.
This is already documented in the etherpad at https://review.rdoproject.org/etherpad/p/nodepool-infra-debugging.
Regards,
Javier
> Regards,
> -Tristan
>
More information about the dev
mailing list