December 2017 - dev - Rdo List Archives

Speaker wanted: Florida/Georgia/Alabama area

by Rich Bowen

Do we have any people on the list who are in the Florida area who might be able to give a talk to a meetup group about RDO? Please contact me off-list. Thanks. -- Rich Bowen - rbowen(a)redhat.com @RDOcommunity // @CentOSProject // @rbowen

7 years, 10 months

1
0
0 / 0

[ci] rdo-infra and tripleo-ci communication between teams

by Wesley Hayutin

Greetings RDO Infra teammates, Let me begin with stating how much we appreciate the advanced level of support that the RDO Infra team provides to its infrastructure users and all the extra things done for lots of teams. However, from time to time messages between the teams are lost or not fully communicated and has lead to some hiccups. For example, some changes have been made to the openstack-nodepool tenant., the TripleO-related scripts in review.rdoproject.org and the general infrastructure that have had a significant impact on CI. In some cases, communication about these changes did not reach the entire TripleO CI team in time. In the spirit of continuous improvement, we are looking for ways to streamline communication. Below are some ideas: - Post about the upcoming change on public mailing list (rdo-users?) - Any emails/posts related to outages or maintenance work on openstack-nodepool tenant have a subject prefix like [outage]/[maintenance]. This will raise the visibility of the email in Gmail inboxes - Bring information about the change to an RDO CI Team meeting - Avoid pinging one TripleO CI team member on chat - rather email the whole team - Add RDO CI team members to reviews related to the openstack-nodepool tenant, zuul/upstream.yaml, jobs/tripleo-upstream.yml, other TripleO CI areas - Perhaps a ticketing system? Communication between teams within the production chain, in general, seems to be mostly informal. As such, we will propose adding a session to the upcoming "production chain sync" meeting early next year on the topic of inter-team information sharing. We understand that good communication goes both ways and would be open to hearing feedback and other suggestions. Thanks all

7 years, 10 months

1
1
0 / 0

[outage] Recurring Nodepool problems in review.rdoproject.org

by David Moreau Simard

Hi, Last night at around 02:00UTC, we've had a recurrence of an issue with the nodepool instance in review.rdoproject.org where it gets stuck and quite literally stops doing anything until it is restarted. This issue leads to jobs appear queued in Zuul without any running jobs since there are no nodes being provided by Nodepool. This time corellates with a ticket opened with the RDO Cloud operations regarding heavily degraded disk performance causing flapping alerts from our monitoring. Nodepool was restarted around 09:15UTC, time at which jobs started running again. After rigorously troubleshooting what happened, we found and fixed several issues which compounded each other: #1 The clouds.yaml file for Nodepool did not specify an api-timeout parameter This parameter defaults to None [1][2] and this caused Shade to potentially hang for very long periods of time. We've added the api-timeout parameter to our clouds.yaml file for each Nodepool provider and this should help Nodepool (and shade) recover in a situation where the API might be temporarily unresponsive. We sent a patch to address this in Software Factory's implementation of Nodepool. [3] #2 The clouds.yaml file for Nodepool did not specify caching parameters Caching parameters allows Shade and Nodepool to alleviate the load on the API servers by reducing the amount of calls and we were not caching anything. We aligned our configuration for caching with the same settings currently used upstream and to provide support for it [4]. #3 Shade queries the API to get a list of ports, then queries the API to determine if each port is a floating IP In what is currently considered a bug described here [5], Shade will end up doing 7 API calls for a server with 6 ports (ex: baremetal OVB node): - 1x to get the list of ports - 6x to check if any of the ports are floating IPs This makes some functions Nodepool routinely uses, such as list_servers(), take a considerable amount of time and put a lot of strain on the API server. While troubleshooting this, we found that even though caching was enabled, Shade wasn't using it and this was addressed in a patch written today [6]. We haven't quite figured out a fix to prevent this problematic behavior from happening when caching is disabled but we have several ideas. For the time being, we have worked around the issue by manually applying this patch to our shade installation and enabling caching in clouds.yaml. We have good reasons to believe that these three issues were either responsible or largely contributed to the recurring problem. If it does happen again, we have a procedure to get Nodepool to dump what exactly is going on. [1]: http://git.openstack.org/cgit/openstack/os-client-config/tree/os_client_c... [2]: http://git.openstack.org/cgit/openstack-infra/shade/tree/shade/openstackc... [3]: https://softwarefactory-project.io/r/10542 [4]: https://softwarefactory-project.io/r/10541 [5]: https://storyboard.openstack.org/#!/story/2001394 [6]: https://review.openstack.org/#/c/526127/ David Moreau Simard Senior Software Engineer | OpenStack RDO dmsimard = [irc, github, twitter]

7 years, 10 months

1
0
0 / 0

Fwd: Re: [openstack-community] Table at FOSDEM!

by Rich Bowen

Following up from the meeting this morning: If you're interested in volunteering to work the OpenStack table at FOSDEM this year, the etherpad is below. -------- Forwarded Message -------- Subject: Re: [openstack-community] Table at FOSDEM! Date: Wed, 6 Dec 2017 17:26:24 +0100 From: Adrien Cunin <adrien(a)adriencunin.fr> To: community(a)lists.openstack.org Le 06/12/2017 à 16:40, Rich Bowen a écrit : > The time has arrived again. I see that we have a table at FOSDEM again - > https://fosdem.org/2018/stands/ - and, as usual, I'm willing to work > shifts. Who's running the organizing this year? Hello, Indeed we have :) Here is the brand new etherpad for this year: https://etherpad.openstack.org/p/fosdem-2018 Feel free to add your name! Adrien

7 years, 10 months

1
0
0 / 0

[Meeting] RDO meeting (2017-12-06) minutes

by Haïkel

============================== #rdo: RDO meeting - 2017-12-06 ============================== Meeting started by number80 at 15:00:26 UTC. The full logs are available at http://eavesdrop.openstack.org/meetings/rdo_meeting___2017_12_06/2017/rdo... . Meeting summary --------------- * roll call (number80, 15:00:38) * building py3 services (number80, 15:05:55) * test days (number80, 15:17:47) * LINK: https://etherpad.openstack.org/p/RDO-Meeting (number80, 15:20:59) * LINK: https://etherpad.openstack.org/p/rdo-queens-m2-cloud (dmsimard, 15:21:01) * LINK: https://dmsimard.com/2017/11/29/come-try-a-real-openstack-queens-deployment/ (dmsimard, 15:21:11) * trystack setup for test days is ongoing (number80, 15:24:47) * ACTION: all spread the word about test days (number80, 15:24:59) * ACTION: number80 add new items in "does it work" (number80, 15:28:38) * Last chance to submit a talk for the CentOS Dojo at FOSDEM: https://goo.gl/forms/FVOEtVOukuCGEnsG2 (number80, 15:30:17) * LINK: https://goo.gl/forms/FVOEtVOukuCGEnsG2 (number80, 15:30:35) * Both OpenStack and CentOS will have stands at FOSDEM this year (number80, 15:34:19) * we're looking for volunteers to man the OpenStack and CentOS booth at FOSDEM (number80, 15:37:14) * open floor (number80, 15:37:48) * easyfix is looking for new projects ideas (number80, 15:38:26) * LINK: https://github.com/redhat-openstack/easyfix/issues (number80, 15:38:44) * ACTION: chandankumar to chair next week meeting (number80, 15:40:35) Meeting ended at 15:41:59 UTC. Action items, by person ----------------------- * chandankumar * chandankumar to chair next week meeting * number80 * number80 add new items in "does it work" People present (lines said) --------------------------- * number80 (72) * rbowen (27) * dmsimard (13) * tosky (8) * chandankumar (8) * jruzicka (7) * openstack (6) * ykarel (4) * rdogerrit (2) * mary_grace (2) * myoung|ruck (1) * Duck (1) Generated by `MeetBot`_ 0.1.4

7 years, 10 months

1
0
0 / 0

[outage] RDO Container Registry

by David Moreau Simard

Hi, While we monitor the disk space utilization of trunk.registry.rdoproject.org, the alerts for it were silenced due to an ongoing false positive. Last November 28th, we pruned the metadata of ~5000 image tags [1] >7 days old after which we were supposed to prune the (now orphaned) blobs. The blobs were not deleted and this lead to the registry partition running out of disk space. Container pushes from approximately the last 48 hours may have failed due as a result of the issue. We're currently pruning the orphaned blobs and pushes should work once enough free space is available. We'll address the false positive on the monitoring alert ASAP and we hope to automate the pruning process in the future to prevent this from re-ocurring. Let me know if you have any questions, [1]: https://paste.fedoraproject.org/paste/Js2rOwOFWdUqfRrBcZaXHw David Moreau Simard Senior Software Engineer | OpenStack RDO dmsimard = [irc, github, twitter]

7 years, 10 months

2
1
0 / 0

Re: [rdo-dev] [rhos-dev] [infra][outage] Nodepool outage on review.rdoproject.org, December 2

by Javier Pena

----- Original Message ----- > On Sat, Dec 02, 2017 at 01:57:08PM +0100, Alfredo Moralejo Alonso wrote: > > On Sat, Dec 2, 2017 at 11:56 AM, Javier Pena <jpena(a)redhat.com> wrote: > > > > > Hi all, > > > > > > We had another nodepool outage this morning. Around 9:00 UTC, amoralej > > > noticed that no new jobs were being processed. He restarted nodepool, and > > > I > > > helped him later with some stale node cleanup. Nodepool started creating > > > VMs successfully around 10:00 UTC. > > > > > > On a first look at the logs, we see no new messages after 7:30 (not even > > > DEBUG logs), but I was unable to run more troubleshooting steps because > > > the > > > service was already restarted. > > > > > > > > In case it helps, i could run successfully both "nodepool list" and > > "nodepool delete <id> --now" (for a couple of instances in delete status) > > before restarting nodepool. However nothing appeared in logs and no > > instances were created for jobs in queue so i restarted nodepool-launcher > > (my understanding was that it fixed similar situations in the past) before > > Javier started working on it. > > > > > > > We will go through the logs on Monday to investigate what happened during > > > the outage. > > > > > > Regards, > > > Javier > > > > Please reach out to me the next time you restart it, something is seriously > wrong is we have to keep restarting nodepool every few days. At this rate, I > would even leave nodepool-launcher is the bad state until we inspect it. > Hi Paul, This happened on a Saturday morning, so I did not expect you to be around. Had it been on a working day, of course I would have pinged you. Leaving nodepool-launcher in bad state for the whole weekend would mean that no jobs would be running at all, including promotion jobs. This is usually not acceptable, but I'll do it if everyone agrees it is ok to wait until Monday. Regards, Javier > Thanks, > PB > >

7 years, 10 months

1
0
0 / 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

dev December 2017