Speaker wanted: Florida/Georgia/Alabama area
by Rich Bowen
Do we have any people on the list who are in the Florida area who might
be able to give a talk to a meetup group about RDO?
Please contact me off-list. Thanks.
--
Rich Bowen - rbowen(a)redhat.com
@RDOcommunity // @CentOSProject // @rbowen
6 years, 11 months
[ci] rdo-infra and tripleo-ci communication between teams
by Wesley Hayutin
Greetings RDO Infra teammates,
Let me begin with stating how much we appreciate the advanced level of
support that the RDO Infra team provides to its infrastructure users and
all the extra things done for lots of teams. However, from time to time
messages between the teams are lost or not fully communicated and has lead
to some hiccups.
For example, some changes have been made to the openstack-nodepool tenant.,
the TripleO-related scripts in review.rdoproject.org and the general
infrastructure that have had a significant impact on CI. In some cases,
communication about these changes did not reach the entire TripleO CI team
in time.
In the spirit of continuous improvement, we are looking for ways to
streamline communication. Below are some ideas:
- Post about the upcoming change on public mailing list (rdo-users?)
- Any emails/posts related to outages or maintenance work on
openstack-nodepool tenant have a subject prefix like
[outage]/[maintenance]. This will raise the visibility of the email in
Gmail inboxes
- Bring information about the change to an RDO CI Team meeting
- Avoid pinging one TripleO CI team member on chat - rather email the whole
team
- Add RDO CI team members to reviews related to the openstack-nodepool
tenant, zuul/upstream.yaml, jobs/tripleo-upstream.yml, other TripleO CI
areas
- Perhaps a ticketing system?
Communication between teams within the production chain, in general, seems
to be mostly informal. As such, we will propose adding a session to the
upcoming "production chain sync" meeting early next year on the topic of
inter-team information sharing. We understand that good communication goes
both ways and would be open to hearing feedback and other suggestions.
Thanks all
6 years, 11 months
[outage] Recurring Nodepool problems in review.rdoproject.org
by David Moreau Simard
Hi,
Last night at around 02:00UTC, we've had a recurrence of an issue with the
nodepool instance in review.rdoproject.org where it gets stuck and quite
literally stops doing anything until it is restarted.
This issue leads to jobs appear queued in Zuul without any running jobs
since there are no nodes being provided by Nodepool.
This time corellates with a ticket opened with the RDO Cloud operations
regarding heavily degraded disk performance causing flapping alerts from
our monitoring.
Nodepool was restarted around 09:15UTC, time at which jobs started running
again.
After rigorously troubleshooting what happened, we found and fixed several
issues which compounded each other:
#1 The clouds.yaml file for Nodepool did not specify an api-timeout
parameter
This parameter defaults to None [1][2] and this caused Shade to potentially
hang for very long periods of time.
We've added the api-timeout parameter to our clouds.yaml file for each
Nodepool provider and this should help Nodepool (and shade) recover in a
situation where the API might be temporarily unresponsive.
We sent a patch to address this in Software Factory's implementation of
Nodepool. [3]
#2 The clouds.yaml file for Nodepool did not specify caching parameters
Caching parameters allows Shade and Nodepool to alleviate the load on the
API servers by reducing the amount of calls and we were not caching
anything.
We aligned our configuration for caching with the same settings currently
used upstream and to provide support for it [4].
#3 Shade queries the API to get a list of ports, then queries the API to
determine if each port is a floating IP
In what is currently considered a bug described here [5], Shade will end up
doing 7 API calls for a server with 6 ports (ex: baremetal OVB node):
- 1x to get the list of ports
- 6x to check if any of the ports are floating IPs
This makes some functions Nodepool routinely uses, such as list_servers(),
take a considerable amount of time and put a lot of strain on the API
server.
While troubleshooting this, we found that even though caching was enabled,
Shade wasn't using it and this was addressed in a patch written today [6].
We haven't quite figured out a fix to prevent this problematic behavior
from happening when caching is disabled but we have several ideas.
For the time being, we have worked around the issue by manually applying
this patch to our shade installation and enabling caching in clouds.yaml.
We have good reasons to believe that these three issues were either
responsible or largely contributed to the recurring problem.
If it does happen again, we have a procedure to get Nodepool to dump what
exactly is going on.
[1]:
http://git.openstack.org/cgit/openstack/os-client-config/tree/os_client_c...
[2]:
http://git.openstack.org/cgit/openstack-infra/shade/tree/shade/openstackc...
[3]: https://softwarefactory-project.io/r/10542
[4]: https://softwarefactory-project.io/r/10541
[5]: https://storyboard.openstack.org/#!/story/2001394
[6]: https://review.openstack.org/#/c/526127/
David Moreau Simard
Senior Software Engineer | OpenStack RDO
dmsimard = [irc, github, twitter]
6 years, 11 months
Fwd: Re: [openstack-community] Table at FOSDEM!
by Rich Bowen
Following up from the meeting this morning: If you're interested in
volunteering to work the OpenStack table at FOSDEM this year, the
etherpad is below.
-------- Forwarded Message --------
Subject: Re: [openstack-community] Table at FOSDEM!
Date: Wed, 6 Dec 2017 17:26:24 +0100
From: Adrien Cunin <adrien(a)adriencunin.fr>
To: community(a)lists.openstack.org
Le 06/12/2017 à 16:40, Rich Bowen a écrit :
> The time has arrived again. I see that we have a table at FOSDEM again -
> https://fosdem.org/2018/stands/ - and, as usual, I'm willing to work
> shifts. Who's running the organizing this year?
Hello,
Indeed we have :)
Here is the brand new etherpad for this year:
https://etherpad.openstack.org/p/fosdem-2018
Feel free to add your name!
Adrien
6 years, 11 months
[Meeting] RDO meeting (2017-12-06) minutes
by Haïkel
==============================
#rdo: RDO meeting - 2017-12-06
==============================
Meeting started by number80 at 15:00:26 UTC. The full logs are
available at
http://eavesdrop.openstack.org/meetings/rdo_meeting___2017_12_06/2017/rdo...
.
Meeting summary
---------------
* roll call (number80, 15:00:38)
* building py3 services (number80, 15:05:55)
* test days (number80, 15:17:47)
* LINK: https://etherpad.openstack.org/p/RDO-Meeting (number80,
15:20:59)
* LINK: https://etherpad.openstack.org/p/rdo-queens-m2-cloud
(dmsimard, 15:21:01)
* LINK:
https://dmsimard.com/2017/11/29/come-try-a-real-openstack-queens-deployment/
(dmsimard, 15:21:11)
* trystack setup for test days is ongoing (number80, 15:24:47)
* ACTION: all spread the word about test days (number80, 15:24:59)
* ACTION: number80 add new items in "does it work" (number80,
15:28:38)
* Last chance to submit a talk for the CentOS Dojo at FOSDEM:
https://goo.gl/forms/FVOEtVOukuCGEnsG2 (number80, 15:30:17)
* LINK: https://goo.gl/forms/FVOEtVOukuCGEnsG2 (number80, 15:30:35)
* Both OpenStack and CentOS will have stands at FOSDEM this year
(number80, 15:34:19)
* we're looking for volunteers to man the OpenStack and CentOS booth
at FOSDEM (number80, 15:37:14)
* open floor (number80, 15:37:48)
* easyfix is looking for new projects ideas (number80, 15:38:26)
* LINK: https://github.com/redhat-openstack/easyfix/issues (number80,
15:38:44)
* ACTION: chandankumar to chair next week meeting (number80,
15:40:35)
Meeting ended at 15:41:59 UTC.
Action items, by person
-----------------------
* chandankumar
* chandankumar to chair next week meeting
* number80
* number80 add new items in "does it work"
People present (lines said)
---------------------------
* number80 (72)
* rbowen (27)
* dmsimard (13)
* tosky (8)
* chandankumar (8)
* jruzicka (7)
* openstack (6)
* ykarel (4)
* rdogerrit (2)
* mary_grace (2)
* myoung|ruck (1)
* Duck (1)
Generated by `MeetBot`_ 0.1.4
6 years, 11 months
[outage] RDO Container Registry
by David Moreau Simard
Hi,
While we monitor the disk space utilization of trunk.registry.rdoproject.org,
the alerts for it were silenced due to an ongoing false positive.
Last November 28th, we pruned the metadata of ~5000 image tags [1] >7 days
old after which we were supposed to prune the (now orphaned) blobs.
The blobs were not deleted and this lead to the registry partition running
out of disk space.
Container pushes from approximately the last 48 hours may have failed due
as a result of the issue.
We're currently pruning the orphaned blobs and pushes should work once
enough free space is available.
We'll address the false positive on the monitoring alert ASAP and we hope
to automate the pruning process in the future to prevent this from
re-ocurring.
Let me know if you have any questions,
[1]: https://paste.fedoraproject.org/paste/Js2rOwOFWdUqfRrBcZaXHw
David Moreau Simard
Senior Software Engineer | OpenStack RDO
dmsimard = [irc, github, twitter]
6 years, 11 months
Re: [rdo-dev] [rhos-dev] [infra][outage] Nodepool outage on review.rdoproject.org, December 2
by Javier Pena
----- Original Message -----
> On Sat, Dec 02, 2017 at 01:57:08PM +0100, Alfredo Moralejo Alonso wrote:
> > On Sat, Dec 2, 2017 at 11:56 AM, Javier Pena <jpena(a)redhat.com> wrote:
> >
> > > Hi all,
> > >
> > > We had another nodepool outage this morning. Around 9:00 UTC, amoralej
> > > noticed that no new jobs were being processed. He restarted nodepool, and
> > > I
> > > helped him later with some stale node cleanup. Nodepool started creating
> > > VMs successfully around 10:00 UTC.
> > >
> > > On a first look at the logs, we see no new messages after 7:30 (not even
> > > DEBUG logs), but I was unable to run more troubleshooting steps because
> > > the
> > > service was already restarted.
> > >
> > >
> > In case it helps, i could run successfully both "nodepool list" and
> > "nodepool delete <id> --now" (for a couple of instances in delete status)
> > before restarting nodepool. However nothing appeared in logs and no
> > instances were created for jobs in queue so i restarted nodepool-launcher
> > (my understanding was that it fixed similar situations in the past) before
> > Javier started working on it.
> >
> >
> > > We will go through the logs on Monday to investigate what happened during
> > > the outage.
> > >
> > > Regards,
> > > Javier
> > >
> Please reach out to me the next time you restart it, something is seriously
> wrong is we have to keep restarting nodepool every few days. At this rate, I
> would even leave nodepool-launcher is the bad state until we inspect it.
>
Hi Paul,
This happened on a Saturday morning, so I did not expect you to be around. Had it been on a working day, of course I would have pinged you.
Leaving nodepool-launcher in bad state for the whole weekend would mean that no jobs would be running at all, including promotion jobs. This is usually not acceptable, but I'll do it if everyone agrees it is ok to wait until Monday.
Regards,
Javier
> Thanks,
> PB
>
>
6 years, 11 months