Hi,

Last night at around 02:00UTC, we've had a recurrence of an issue with the nodepool instance in review.rdoproject.org where it gets stuck and quite literally stops doing anything until it is restarted.
This issue leads to jobs appear queued in Zuul without any running jobs since there are no nodes being provided by Nodepool.

This time corellates with a ticket opened with the RDO Cloud operations regarding heavily degraded disk performance causing flapping alerts from our monitoring.

Nodepool was restarted around 09:15UTC, time at which jobs started running again.

After rigorously troubleshooting what happened, we found and fixed several issues which compounded each other:

#1 The clouds.yaml file for Nodepool did not specify an api-timeout parameter
This parameter defaults to None [1][2] and this caused Shade to potentially hang for very long periods of time.
We've added the api-timeout parameter to our clouds.yaml file for each Nodepool provider and this should help Nodepool (and shade) recover in a situation where the API might be temporarily unresponsive.
We sent a patch to address this in Software Factory's implementation of Nodepool. [3]

#2 The clouds.yaml file for Nodepool did not specify caching parameters
Caching parameters allows Shade and Nodepool to alleviate the load on the API servers by reducing the amount of calls and we were not caching anything.
We aligned our configuration for caching with the same settings currently used upstream and to provide support for it [4].

#3 Shade queries the API to get a list of ports, then queries the API to determine if each port is a floating IP
In what is currently considered a bug described here [5], Shade will end up doing 7 API calls for a server with 6 ports (ex: baremetal OVB node):
- 1x to get the list of ports
- 6x to check if any of the ports are floating IPs

This makes some functions Nodepool routinely uses, such as list_servers(), take a considerable amount of time and put a lot of strain on the API server.
While troubleshooting this, we found that even though caching was enabled, Shade wasn't using it and this was addressed in a patch written today [6].
We haven't quite figured out a fix to prevent this problematic behavior from happening when caching is disabled but we have several ideas.

For the time being, we have worked around the issue by manually applying this patch to our shade installation and enabling caching in clouds.yaml.

We have good reasons to believe that these three issues were either responsible or largely contributed to the recurring problem.
If it does happen again, we have a procedure to get Nodepool to dump what exactly is going on.

[1]: http://git.openstack.org/cgit/openstack/os-client-config/tree/os_client_config/defaults.py?id=0d062b7d6fb29df035508117c15770c6846f7df4#n41
[2]: http://git.openstack.org/cgit/openstack-infra/shade/tree/shade/openstackcloud.py?id=faaf5f024bd57162466c273dde370068ce55decf#n164
[3]: https://softwarefactory-project.io/r/10542
[4]: https://softwarefactory-project.io/r/10541

David Moreau Simard
Senior Software Engineer | OpenStack RDO

dmsimard = [irc, github, twitter]