[Rdo-list] rdopkg overview

Thursday, 29 January 2015

I have been struggling with the amount of information to convey and what level of detail
to include. Since I can't seem to get it perfect to my own satisfaction, here is the
imperfect (and long, sorry) version to begin discussion.

This is an overview of where things stand (rdopkg CI 'v0.1').

Terminology:
'Release' refers to an OpenStack release (e.g. havana,icehouse,juno)
'Dist' refers to a distro supported by RDO (e.g. fedora-20, epel-6, epel-7)
'phase1' is the initial smoketest for an update submitted via `rdopkg update`
'phase2' is a full-provision test for accumulated updates that have passed phase1
'snapshot' means an OpenStack snapshot of a running instance, i.e. a disk image
created from a running OS instance.

The very broad strokes:
-----------------------

rdopkg ci is triggered when a packager uses `rdopkg update`.

When a review lands in the rdo-update gerrit project, a 'phase1' smoketest is
initiated via jenkins for each Release/Dist combination present in the update (e.g. if the
update contains builds for icehouse/fedora-20 and icehouse/epel-6, each set of RPMs from
each build will be smoketested on an instance running the associated Release/Dist). If
*all* supported builds from the update pass phase1, then the update is merged into
rdo-update. Updates that pass phase1 accumulate in the updates/ directory in the
rdo-update project.

Periodically, a packager may run 'phase2'. This takes everything in updates/ and
uses those RPMs + RDO production repo to provision a set of base images with packstack
aio. Again, a simple tempest test is run against the packstack aio instances. If all pass,
then phase2 passes, and the `rdopkg update` yaml files are moved from updates/ to ready/.

At that point, someone with the keys to the stage repos will push the builds in ready/ to
the stage repo. If CI against stage repo passes, stage is rsynced to production.

Complexity, Part 1:
-------------------

Rdopkg CI v0.1 was designed around the use of OpenStack VM disk snapshots. On a periodic
basis, we provision two nodes for each supported combination in [Releases] X [Dists] (e.g.
"icehouse, fedora-20" "juno, epel-7" etc). One node is a packstack aio
instance built against RDO production repos, and the other is a node running tempest.
After a simple tempest test passes for all the packstack aio nodes, we would snapshot the
set of instances. Then when we want to do a 'phase1' test for e.g. "icehouse,
fedora-20", we can spin up the instances previously snapshotted and save the time of
re-running packstack aio.

Using snapshots saves approximately 30 min of wait time per test run by skipping
provisioning. Using snapshots imposes a few substantial costs/complexities though. First
and most significant, snapshots need to be reinstantiated using the same IP addresses that
were present when packstack and tempest were run during the provisioning. This means we
have to have concurrency control around running only one phase1 run at a time; otherwise
an instance might fail to provision because its 'static' IP address is already in
use by another run. The second cost is that in practice, a) our OpenStack infrastructure
has been unreliable, b) not all Release/Dist combinations reliably provision. So it
becomes hard to create a full set of snapshots reliably.

Additionally, some updates (e.g. when an update comes in for openstack-puppet-modules)
prevent the use of a previously-provisioned packstack instance. Continuing with the o-p-m
example: that package is used for provisioning. So simply updating the RPM for that
package after running packstack aio doesn't tell us anything about the package sanity
(other than perhaps if a new, unsatisfied RPM dependency was introduced).

Another source of complexity comes from the nature of the rdopkg update 'unit'.
Each yaml file created by `rdopkg update` can contain multiple builds for different
Release,Dist combinations. So there must be a way to 'collate' the results of each
smoketest for each Release,Dist and pass phase1 only if all updates pass. Furthermore,
some combinations of Release,Dist are known (at times, for various ad hoc reasons) to fail
testing, and those combinations sometimes need to be 'disabled'. For example, if
we know that icehouse/f20 is 'red' on a given day, we might want an update
containing icehouse/fedora-20,icehouse/epel-6 to test only the icehouse/epel-6 combination
and pass if that passes.

Finally, pursuant to the previous point, there need to be 'control' structure jobs
for provision/snapshot, phase1, and phase2 runs that pass (and perform some action upon
passing) only when all their 'child' jobs have passed.

The way we have managed this complexity to date is through the use of the jenkins
BuildFlow plugin. Here's some ASCII art (courtesy of 'tree') to show how the
jobs are structured now (these are descriptive job names, not the actual jenkins job
names). BuildFlow jobs are indicated by (bf).

.
`-- rdopkg_master_flow (bf)
    |-- provision_and_snapshot (bf)
    |   |-- provision_and_snapshot_icehouse_epel6
    |   |-- provision_and_snapshot_icehouse_f20
    |   |-- provision_and_snapshot_juno_epel7
    |   `-- provision_and_snapshot_juno_f21
    |-- phase1_flow (bf)
    |   |-- phase1_test_icehouse_f20
    |   `-- phase1_test_juno_f21
    `-- phase2_flow (bf)
        |-- phase2_test_icehouse_epel6
        |-- phase2_test_icehouse_f20
        |-- phase2_test_juno_epel7
        `-- phase2_test_juno_f21

When a change comes in from `rdopkg update`, the rdopkg_master_flow job is triggered.
It's the only job that gets triggered from gerrit, so it kicks off phase1_flow.
phase1_flow runs 'child' jobs (normal jenkins jobs, not buildflow) for each
Release,Dist combination present in the update.

provision_and_snapshot is run by manually setting a build parameter (BUILD_SNAPS) in the
rdopkg_master_flow job, and triggering the build of rdopkg_master_flow.

phase2 is invoked similar to the provision_and_snapshot build, by checking
'RUN_PHASE2' in the rdopkg_master_flow build parameters before executing a build
thereof.

Concurrency control is a side effect of requiring the user or gerrit to execute
rdopkg_master_flow for every action. There can be only one rdopkg_master_flow build
executing at any given time.

Complexity, Part 2:
-------------------

In addition to the nasty complexity of using nested BuildFlow type jobs, each
'worker' job (i.e. the non-buildflow type jobs) has some built in complexity that
is reflected in the amount of logic in each job's bash script definition.

Some of this has been alluded to in previous points. For instance, each job in the phase1
flow needs to determine, for each update, if the update contains a package that requires
full packstack aio provisioning from a base image (e.g. openstack-puppet-modules). This
'must provision' list needs to be stored somewhere that all jobs can read it, and
it needs to be dynamic enough to add to it as requirements dictate.

But additionally, for package sets not requiring provisioning a base image, phase1 job
needs to query the backing OpenStack instance to see if there exists a 'known
good' snapshot, get the images' UUIDs from OpenStack, and spin up the instances
using the snapshot images.

This baked-in complexity in the 'worker' jenkins jobs has made it difficult to
maintain the job definitions, and more importantly difficult to run using jjb or in other
more 'orthodox' CI-type ways. The rdopkg CI stuff is a bastard child of a fork. It
lives in its own mutant gene pool.

A Way Forward...?
----------------

Wes Hayutin had a good idea that might help reduce some of the complexity here as we
contemplate a) making rdopkg CI public, b) moving toward rdopkg CI 0.2.

His idea was a) stop using snapshots since the per-test-run savings doesn't seem to
justify the burden they create, b) do away with BuildFlow by including the 'this
update contains builds for (Release1,Dist2),...,(ReleaseN,DistM)' information in the
gerrit change topic.

I think that's a great idea, but I have a superstitious gut feeling that we may lose
some 'transaction'y-ness from the current setup. For example, what happens if
phase1 and phase2 overlap their execution? It's not that I have evidence that this
will be a problem; it's more that we had these issues worked out fairly well with
rdopkg CI 0.1, and I think the change warrants some scrutiny/thought (which clearly I have
not done!).

We'd still need to work out a way to execute phase2, though. There would be no `rdopkg
update` event to trigger phase2 runs. I'm not sure how we'd do that without a
BuildFlow. BuildFlow jobs also allow parallelization of the child jobs, and I'm not
sure how we could replicate that without using that type of job.

Whew. I hope this was helpful. I'm saving a copy of this text to
http://slinabery.fedorapeople.org/rdopkg-overview.txt

Cheers,
Steve Linabery (freenode: eggmaster)
Senior Software Engineer, Red Hat, Inc.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

[Rdo-list] rdopkg overview