[Rdo-list] rdopkg overview

Mon Feb 2 16:15:56 UTC 2015

On Fri, 2015-01-30 at 14:52 +0100, Jakub Ruzicka wrote:
> Very nice overview Steve, thanks for writing this down!
> 
> My random thoughts on the matter inline.
> 
> On 29.1.2015 22:48, Steve Linabery wrote:
> > I have been struggling with the amount of information to convey and what level of detail to include. Since I can't seem to get it perfect to my own satisfaction, here is the imperfect (and long, sorry) version to begin discussion.
> > 
> > This is an overview of where things stand (rdopkg CI 'v0.1').
> 
> For some time, I'm wondering if we should really call it rdopkg CI since
> it's not really tied to rdopkg but to RDO. You can use most of rdopkg on
> any distgit. I reckon we should simply call it RDO CI to avoid
> confusion. I for one don't underestimate the impact of naming stuff ;)
> 
> > Terminology:
> > 'Release' refers to an OpenStack release (e.g. havana,icehouse,juno)
> > 'Dist' refers to a distro supported by RDO (e.g. fedora-20, epel-6, epel-7)
> > 'phase1' is the initial smoketest for an update submitted via `rdopkg update`
> > 'phase2' is a full-provision test for accumulated updates that have passed phase1
> > 'snapshot' means an OpenStack snapshot of a running instance, i.e. a disk image created from a running OS instance.
> > 
> > The very broad strokes:
> > -----------------------
> > 
> > rdopkg ci is triggered when a packager uses `rdopkg update`.
> > 
> > When a review lands in the rdo-update gerrit project, a 'phase1' smoketest is initiated via jenkins for each Release/Dist combination present in the update (e.g. if the update contains builds for icehouse/fedora-20 and icehouse/epel-6, each set of RPMs from each build will be smoketested on an instance running the associated Release/Dist). If *all* supported builds from the update pass phase1, then the update is merged into rdo-update. Updates that pass phase1 accumulate in the updates/ directory in the rdo-update project.
> > 
> > Periodically, a packager may run 'phase2'. This takes everything in updates/ and uses those RPMs + RDO production repo to provision a set of base images with packstack aio. Again, a simple tempest test is run against the packstack aio instances. If all pass, then phase2 passes, and the `rdopkg update` yaml files are moved from updates/ to ready/.
> > 
> > At that point, someone with the keys to the stage repos will push the builds in ready/ to the stage repo. If CI against stage repo passes, stage is rsynced to production.
> > 
> > Complexity, Part 1:
> > -------------------
> > 
> > Rdopkg CI v0.1 was designed around the use of OpenStack VM disk snapshots. On a periodic basis, we provision two nodes for each supported combination in [Releases] X [Dists] (e.g. "icehouse, fedora-20" "juno, epel-7" etc). One node is a packstack aio instance built against RDO production repos, and the other is a node running tempest. After a simple tempest test passes for all the packstack aio nodes, we would snapshot the set of instances. Then when we want to do a 'phase1' test for e.g. "icehouse, fedora-20", we can spin up the instances previously snapshotted and save the time of re-running packstack aio.
> > 
> > Using snapshots saves approximately 30 min of wait time per test run by skipping provisioning. Using snapshots imposes a few substantial costs/complexities though. First and most significant, snapshots need to be reinstantiated using the same IP addresses that were present when packstack and tempest were run during the provisioning. This means we have to have concurrency control around running only one phase1 run at a time; otherwise an instance might fail to provision because its 'static' IP address is already in use by another run. The second cost is that in practice, a) our OpenStack infrastructure has been unreliable, b) not all Release/Dist combinations reliably provision. So it becomes hard to create a full set of snapshots reliably.
> > 
> > Additionally, some updates (e.g. when an update comes in for openstack-puppet-modules) prevent the use of a previously-provisioned packstack instance. Continuing with the o-p-m example: that package is used for provisioning. So simply updating the RPM for that package after running packstack aio doesn't tell us anything about the package sanity (other than perhaps if a new, unsatisfied RPM dependency was introduced).
> > 
> > Another source of complexity comes from the nature of the rdopkg update 'unit'. Each yaml file created by `rdopkg update` can contain multiple builds for different Release,Dist combinations. So there must be a way to 'collate' the results of each smoketest for each Release,Dist and pass phase1 only if all updates pass. Furthermore, some combinations of Release,Dist are known (at times, for various ad hoc reasons) to fail testing, and those combinations sometimes need to be 'disabled'. For example, if we know that icehouse/f20 is 'red' on a given day, we might want an update containing icehouse/fedora-20,icehouse/epel-6 to test only the icehouse/epel-6 combination and pass if that passes.
> > 
> > Finally, pursuant to the previous point, there need to be 'control' structure jobs for provision/snapshot, phase1, and phase2 runs that pass (and perform some action upon passing) only when all their 'child' jobs have passed.
> > 
> > The way we have managed this complexity to date is through the use of the jenkins BuildFlow plugin. Here's some ASCII art (courtesy of 'tree') to show how the jobs are structured now (these are descriptive job names, not the actual jenkins job names). BuildFlow jobs are indicated by (bf).
> > 
> > .
> > `-- rdopkg_master_flow (bf)
> >     |-- provision_and_snapshot (bf)
> >     |   |-- provision_and_snapshot_icehouse_epel6
> >     |   |-- provision_and_snapshot_icehouse_f20
> >     |   |-- provision_and_snapshot_juno_epel7
> >     |   `-- provision_and_snapshot_juno_f21
> >     |-- phase1_flow (bf)
> >     |   |-- phase1_test_icehouse_f20
> >     |   `-- phase1_test_juno_f21
> >     `-- phase2_flow (bf)
> >         |-- phase2_test_icehouse_epel6
> >         |-- phase2_test_icehouse_f20
> >         |-- phase2_test_juno_epel7
> >         `-- phase2_test_juno_f21
> 
> As a consumer of CI results, my main problem with this is it takes about
> 7 clicks to get to the actual error.

+1111
> 
> 
> > When a change comes in from `rdopkg update`, the rdopkg_master_flow job is triggered. It's the only job that gets triggered from gerrit, so it kicks off phase1_flow. phase1_flow runs 'child' jobs (normal jenkins jobs, not buildflow) for each Release,Dist combination present in the update.
> > 
> > provision_and_snapshot is run by manually setting a build parameter (BUILD_SNAPS) in the rdopkg_master_flow job, and triggering the build of rdopkg_master_flow.
> > 
> > phase2 is invoked similar to the provision_and_snapshot build, by checking 'RUN_PHASE2' in the rdopkg_master_flow build parameters before executing a build thereof.
> > 
> > Concurrency control is a side effect of requiring the user or gerrit to execute rdopkg_master_flow for every action. There can be only one rdopkg_master_flow build executing at any given time.
> > 
> > Complexity, Part 2:
> > -------------------
> > 
> > In addition to the nasty complexity of using nested BuildFlow type jobs, each 'worker' job (i.e. the non-buildflow type jobs) has some built in complexity that is reflected in the amount of logic in each job's bash script definition.
> > 
> > Some of this has been alluded to in previous points. For instance, each job in the phase1 flow needs to determine, for each update, if the update contains a package that requires full packstack aio provisioning from a base image (e.g. openstack-puppet-modules). This 'must provision' list needs to be stored somewhere that all jobs can read it, and it needs to be dynamic enough to add to it as requirements dictate.
> > 
> > But additionally, for package sets not requiring provisioning a base image, phase1 job needs to query the backing OpenStack instance to see if there exists a 'known good' snapshot, get the images' UUIDs from OpenStack, and spin up the instances using the snapshot images.
> > 
> > This baked-in complexity in the 'worker' jenkins jobs has made it difficult to maintain the job definitions, and more importantly difficult to run using jjb or in other more 'orthodox' CI-type ways. The rdopkg CI stuff is a bastard child of a fork. It lives in its own mutant gene pool.
> 
> lolololol
> 
> 
> > A Way Forward...?
> > ----------------
> > 
> > Wes Hayutin had a good idea that might help reduce some of the complexity here as we contemplate a) making rdopkg CI public, b) moving toward rdopkg CI 0.2.
> > 
> > His idea was a) stop using snapshots since the per-test-run savings doesn't seem to justify the burden they create, b) do away with BuildFlow by including the 'this update contains builds for (Release1,Dist2),...,(ReleaseN,DistM)' information in the gerrit change topic.
> 
> It's easy to modify `rdopkg update` to include this information.
> However, it's redundant so you can (in theory) submit an update where
> this summary won't match the actual YAML data. That's probably
> completely irrelevant, but I'm mentioning it nonetheless :)
> 
I was hoping that we may be able to wrap the submission such that the
topic is generated automatically from the YAML data.

> 
> > I think that's a great idea, but I have a superstitious gut feeling that we may lose some 'transaction'y-ness from the current setup. For example, what happens if phase1 and phase2 overlap their execution? It's not that I have evidence that this will be a problem; it's more that we had these issues worked out fairly well with rdopkg CI 0.1, and I think the change warrants some scrutiny/thought (which clearly I have not done!).
> 
> Worth a try if you ask me. I'll gladly help with any scripts/helpers
> needed, so just let me know.
> 
> 
> > We'd still need to work out a way to execute phase2, though. There would be no `rdopkg update` event to trigger phase2 runs. I'm not sure how we'd do that without a BuildFlow. BuildFlow jobs also allow parallelization of the child jobs, and I'm not sure how we could replicate that without using that type of job.
> 
> There can be rdopkg action (i.e. rdopkg update --phase2) to do whatever
> you need if that helps.

So I suppose Phase 1 would be a reduced to a basic packstack job that
runs through w/ the update from the devels submission.

Phase 2 addressed what happens when a group of submissions are pooled
together. Does one submission break another submission that could not
have been detected in the individual submissions themselves.  I see two
options here..

A. Let the issue get sorted out in the stage yum repo.
B. Create a temporary yum repo with collection of submissions with a job
hitting rdo production and the temporary repo.

Thoughts?

> 
> 
> > Whew. I hope this was helpful. I'm saving a copy of this text to http://slinabery.fedorapeople.org/rdopkg-overview.txt
> 
> Sure is, thanks!
> 
> 
> Cheers,
> Jakub
> 
> _______________________________________________
> Rdo-list mailing list
> Rdo-list at redhat.com
> https://www.redhat.com/mailman/listinfo/rdo-list
> 
> To unsubscribe: rdo-list-unsubscribe at redhat.com