[rdo-dev] [outage] Recurring Nodepool problems in review.rdoproject.org

Wed Dec 6 20:03:15 UTC 2017

Hi,

Last night at around 02:00UTC, we've had a recurrence of an issue with the
nodepool instance in review.rdoproject.org where it gets stuck and quite
literally stops doing anything until it is restarted.
This issue leads to jobs appear queued in Zuul without any running jobs
since there are no nodes being provided by Nodepool.

This time corellates with a ticket opened with the RDO Cloud operations
regarding heavily degraded disk performance causing flapping alerts from
our monitoring.

Nodepool was restarted around 09:15UTC, time at which jobs started running
again.

After rigorously troubleshooting what happened, we found and fixed several
issues which compounded each other:

#1 The clouds.yaml file for Nodepool did not specify an api-timeout
parameter
This parameter defaults to None [1][2] and this caused Shade to potentially
hang for very long periods of time.
We've added the api-timeout parameter to our clouds.yaml file for each
Nodepool provider and this should help Nodepool (and shade) recover in a
situation where the API might be temporarily unresponsive.
We sent a patch to address this in Software Factory's implementation of
Nodepool. [3]

#2 The clouds.yaml file for Nodepool did not specify caching parameters
Caching parameters allows Shade and Nodepool to alleviate the load on the
API servers by reducing the amount of calls and we were not caching
anything.
We aligned our configuration for caching with the same settings currently
used upstream and to provide support for it [4].

#3 Shade queries the API to get a list of ports, then queries the API to
determine if each port is a floating IP
In what is currently considered a bug described here [5], Shade will end up
doing 7 API calls for a server with 6 ports (ex: baremetal OVB node):
- 1x to get the list of ports
- 6x to check if any of the ports are floating IPs

This makes some functions Nodepool routinely uses, such as list_servers(),
take a considerable amount of time and put a lot of strain on the API
server.
While troubleshooting this, we found that even though caching was enabled,
Shade wasn't using it and this was addressed in a patch written today [6].
We haven't quite figured out a fix to prevent this problematic behavior
from happening when caching is disabled but we have several ideas.

For the time being, we have worked around the issue by manually applying
this patch to our shade installation and enabling caching in clouds.yaml.

We have good reasons to believe that these three issues were either
responsible or largely contributed to the recurring problem.
If it does happen again, we have a procedure to get Nodepool to dump what
exactly is going on.

[1]:
http://git.openstack.org/cgit/openstack/os-client-config/tree/os_client_config/defaults.py?id=0d062b7d6fb29df035508117c15770c6846f7df4#n41
[2]:
http://git.openstack.org/cgit/openstack-infra/shade/tree/shade/openstackcloud.py?id=faaf5f024bd57162466c273dde370068ce55decf#n164
[3]: https://softwarefactory-project.io/r/10542
[4]: https://softwarefactory-project.io/r/10541
[5]: https://storyboard.openstack.org/#!/story/2001394
[6]: https://review.openstack.org/#/c/526127/

David Moreau Simard
Senior Software Engineer | OpenStack RDO

dmsimard = [irc, github, twitter]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rdoproject.org/pipermail/dev/attachments/20171206/5d7df54a/attachment-0001.html>