<div dir="ltr"><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Hi,</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Last night at around 02:00UTC, we've had a recurrence of an issue with the nodepool instance in <a href="http://review.rdoproject.org">review.rdoproject.org</a> where it gets stuck and quite literally stops doing anything until it is restarted.</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">This issue leads to jobs appear queued in Zuul without any running jobs since there are no nodes being provided by Nodepool.</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">This time corellates with a ticket opened with the RDO Cloud operations regarding heavily degraded disk performance causing flapping alerts from our monitoring.<br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Nodepool was restarted around 09:15UTC, time at which jobs started running again.</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">After rigorously troubleshooting what happened, we found and fixed several issues which compounded each other:</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">#1 The clouds.yaml file for Nodepool did not specify an api-timeout parameter</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">This parameter defaults to None [1][2] and this caused Shade to potentially hang for very long periods of time.</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">We've added the api-timeout parameter to our clouds.yaml file for each Nodepool provider and this should help Nodepool (and shade) recover in a situation where the API might be temporarily unresponsive.</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">We sent a patch to address this in Software Factory's implementation of Nodepool. [3]</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">#2  The clouds.yaml file for Nodepool did not specify caching parameters</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Caching parameters allows Shade and Nodepool to alleviate the load on the API servers by reducing the amount of calls and we were not caching anything.</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">We aligned our configuration for caching with the same settings currently used upstream and to provide support for it [4].</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">#3 Shade queries the API to get a list of ports, then queries the API to determine if each port is a floating IP</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">In what is currently considered a bug described here [5], Shade will end up doing 7 API calls for a server with 6 ports (ex: baremetal OVB node):</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">- 1x to get the list of ports</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">- 6x to check if any of the ports are floating IPs</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">This makes some functions Nodepool routinely uses, such as list_servers(), take a considerable amount of time and put a lot of strain on the API server.</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">While troubleshooting this, we found that even though caching was enabled, Shade wasn't using it and this was addressed in a patch written today [6].</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">We haven't quite figured out a fix to prevent this problematic behavior from happening when caching is disabled but we have several ideas.</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">For the time being, we have worked around the issue by manually applying this patch to our shade installation and enabling caching in clouds.yaml.</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">We have good reasons to believe that these three issues were either responsible or largely contributed to the recurring problem.</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">If it does happen again, we have a procedure to get Nodepool to dump what exactly is going on.</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">[1]: <a href="http://git.openstack.org/cgit/openstack/os-client-config/tree/os_client_config/defaults.py?id=0d062b7d6fb29df035508117c15770c6846f7df4#n41">http://git.openstack.org/cgit/openstack/os-client-config/tree/os_client_config/defaults.py?id=0d062b7d6fb29df035508117c15770c6846f7df4#n41</a></div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default">[2]: <a href="http://git.openstack.org/cgit/openstack-infra/shade/tree/shade/openstackcloud.py?id=faaf5f024bd57162466c273dde370068ce55decf#n164">http://git.openstack.org/cgit/openstack-infra/shade/tree/shade/openstackcloud.py?id=faaf5f024bd57162466c273dde370068ce55decf#n164</a></div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default">[3]: <a href="https://softwarefactory-project.io/r/10542">https://softwarefactory-project.io/r/10542</a><br></div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default">[4]: <a href="https://softwarefactory-project.io/r/10541">https://softwarefactory-project.io/r/10541</a><br></div><div><div class="gmail_signature"><div style="font-family:arial,helvetica,sans-serif" class="gmail_default">​[5]: <a href="https://storyboard.openstack.org/#!/story/2001394">https://storyboard.openstack.org/#!/story/2001394</a></div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default">[6]: <a href="https://review.openstack.org/#/c/526127/">https://review.openstack.org/#/c/526127/</a>​</div></div><div class="gmail_signature"><br>David Moreau Simard<br>Senior Software Engineer | OpenStack RDO<br><br>dmsimard = [irc, github, twitter]</div></div>
</div>