On Fri, 2015-01-30 at 14:52 +0100, Jakub Ruzicka wrote:
Very nice overview Steve, thanks for writing this down!
My random thoughts on the matter inline.
On 29.1.2015 22:48, Steve Linabery wrote:
> I have been struggling with the amount of information to convey and what level of
detail to include. Since I can't seem to get it perfect to my own satisfaction, here
is the imperfect (and long, sorry) version to begin discussion.
>
> This is an overview of where things stand (rdopkg CI 'v0.1').
For some time, I'm wondering if we should really call it rdopkg CI since
it's not really tied to rdopkg but to RDO. You can use most of rdopkg on
any distgit. I reckon we should simply call it RDO CI to avoid
confusion. I for one don't underestimate the impact of naming stuff ;)
> Terminology:
> 'Release' refers to an OpenStack release (e.g. havana,icehouse,juno)
> 'Dist' refers to a distro supported by RDO (e.g. fedora-20, epel-6, epel-7)
> 'phase1' is the initial smoketest for an update submitted via `rdopkg
update`
> 'phase2' is a full-provision test for accumulated updates that have passed
phase1
> 'snapshot' means an OpenStack snapshot of a running instance, i.e. a disk
image created from a running OS instance.
>
> The very broad strokes:
> -----------------------
>
> rdopkg ci is triggered when a packager uses `rdopkg update`.
>
> When a review lands in the rdo-update gerrit project, a 'phase1' smoketest
is initiated via jenkins for each Release/Dist combination present in the update (e.g. if
the update contains builds for icehouse/fedora-20 and icehouse/epel-6, each set of RPMs
from each build will be smoketested on an instance running the associated Release/Dist).
If *all* supported builds from the update pass phase1, then the update is merged into
rdo-update. Updates that pass phase1 accumulate in the updates/ directory in the
rdo-update project.
>
> Periodically, a packager may run 'phase2'. This takes everything in updates/
and uses those RPMs + RDO production repo to provision a set of base images with packstack
aio. Again, a simple tempest test is run against the packstack aio instances. If all pass,
then phase2 passes, and the `rdopkg update` yaml files are moved from updates/ to ready/.
>
> At that point, someone with the keys to the stage repos will push the builds in
ready/ to the stage repo. If CI against stage repo passes, stage is rsynced to
production.
>
> Complexity, Part 1:
> -------------------
>
> Rdopkg CI v0.1 was designed around the use of OpenStack VM disk snapshots. On a
periodic basis, we provision two nodes for each supported combination in [Releases] X
[Dists] (e.g. "icehouse, fedora-20" "juno, epel-7" etc). One node is a
packstack aio instance built against RDO production repos, and the other is a node running
tempest. After a simple tempest test passes for all the packstack aio nodes, we would
snapshot the set of instances. Then when we want to do a 'phase1' test for e.g.
"icehouse, fedora-20", we can spin up the instances previously snapshotted and
save the time of re-running packstack aio.
>
> Using snapshots saves approximately 30 min of wait time per test run by skipping
provisioning. Using snapshots imposes a few substantial costs/complexities though. First
and most significant, snapshots need to be reinstantiated using the same IP addresses that
were present when packstack and tempest were run during the provisioning. This means we
have to have concurrency control around running only one phase1 run at a time; otherwise
an instance might fail to provision because its 'static' IP address is already in
use by another run. The second cost is that in practice, a) our OpenStack infrastructure
has been unreliable, b) not all Release/Dist combinations reliably provision. So it
becomes hard to create a full set of snapshots reliably.
>
> Additionally, some updates (e.g. when an update comes in for
openstack-puppet-modules) prevent the use of a previously-provisioned packstack instance.
Continuing with the o-p-m example: that package is used for provisioning. So simply
updating the RPM for that package after running packstack aio doesn't tell us anything
about the package sanity (other than perhaps if a new, unsatisfied RPM dependency was
introduced).
>
> Another source of complexity comes from the nature of the rdopkg update
'unit'. Each yaml file created by `rdopkg update` can contain multiple builds for
different Release,Dist combinations. So there must be a way to 'collate' the
results of each smoketest for each Release,Dist and pass phase1 only if all updates pass.
Furthermore, some combinations of Release,Dist are known (at times, for various ad hoc
reasons) to fail testing, and those combinations sometimes need to be 'disabled'.
For example, if we know that icehouse/f20 is 'red' on a given day, we might want
an update containing icehouse/fedora-20,icehouse/epel-6 to test only the icehouse/epel-6
combination and pass if that passes.
>
> Finally, pursuant to the previous point, there need to be 'control'
structure jobs for provision/snapshot, phase1, and phase2 runs that pass (and perform some
action upon passing) only when all their 'child' jobs have passed.
>
> The way we have managed this complexity to date is through the use of the jenkins
BuildFlow plugin. Here's some ASCII art (courtesy of 'tree') to show how the
jobs are structured now (these are descriptive job names, not the actual jenkins job
names). BuildFlow jobs are indicated by (bf).
>
> .
> `-- rdopkg_master_flow (bf)
> |-- provision_and_snapshot (bf)
> | |-- provision_and_snapshot_icehouse_epel6
> | |-- provision_and_snapshot_icehouse_f20
> | |-- provision_and_snapshot_juno_epel7
> | `-- provision_and_snapshot_juno_f21
> |-- phase1_flow (bf)
> | |-- phase1_test_icehouse_f20
> | `-- phase1_test_juno_f21
> `-- phase2_flow (bf)
> |-- phase2_test_icehouse_epel6
> |-- phase2_test_icehouse_f20
> |-- phase2_test_juno_epel7
> `-- phase2_test_juno_f21
As a consumer of CI results, my main problem with this is it takes about
7 clicks to get to the actual error.
+1111
> When a change comes in from `rdopkg update`, the rdopkg_master_flow job is
triggered. It's the only job that gets triggered from gerrit, so it kicks off
phase1_flow. phase1_flow runs 'child' jobs (normal jenkins jobs, not buildflow)
for each Release,Dist combination present in the update.
>
> provision_and_snapshot is run by manually setting a build parameter (BUILD_SNAPS) in
the rdopkg_master_flow job, and triggering the build of rdopkg_master_flow.
>
> phase2 is invoked similar to the provision_and_snapshot build, by checking
'RUN_PHASE2' in the rdopkg_master_flow build parameters before executing a build
thereof.
>
> Concurrency control is a side effect of requiring the user or gerrit to execute
rdopkg_master_flow for every action. There can be only one rdopkg_master_flow build
executing at any given time.
>
> Complexity, Part 2:
> -------------------
>
> In addition to the nasty complexity of using nested BuildFlow type jobs, each
'worker' job (i.e. the non-buildflow type jobs) has some built in complexity that
is reflected in the amount of logic in each job's bash script definition.
>
> Some of this has been alluded to in previous points. For instance, each job in the
phase1 flow needs to determine, for each update, if the update contains a package that
requires full packstack aio provisioning from a base image (e.g.
openstack-puppet-modules). This 'must provision' list needs to be stored somewhere
that all jobs can read it, and it needs to be dynamic enough to add to it as requirements
dictate.
>
> But additionally, for package sets not requiring provisioning a base image, phase1
job needs to query the backing OpenStack instance to see if there exists a 'known
good' snapshot, get the images' UUIDs from OpenStack, and spin up the instances
using the snapshot images.
>
> This baked-in complexity in the 'worker' jenkins jobs has made it difficult
to maintain the job definitions, and more importantly difficult to run using jjb or in
other more 'orthodox' CI-type ways. The rdopkg CI stuff is a bastard child of a
fork. It lives in its own mutant gene pool.
lolololol
> A Way Forward...?
> ----------------
>
> Wes Hayutin had a good idea that might help reduce some of the complexity here as we
contemplate a) making rdopkg CI public, b) moving toward rdopkg CI 0.2.
>
> His idea was a) stop using snapshots since the per-test-run savings doesn't seem
to justify the burden they create, b) do away with BuildFlow by including the 'this
update contains builds for (Release1,Dist2),...,(ReleaseN,DistM)' information in the
gerrit change topic.
It's easy to modify `rdopkg update` to include this information.
However, it's redundant so you can (in theory) submit an update where
this summary won't match the actual YAML data. That's probably
completely irrelevant, but I'm mentioning it nonetheless :)
I was hoping that we may be able to wrap the submission such that the
topic is generated automatically from the YAML data.
> I think that's a great idea, but I have a superstitious gut feeling that we may
lose some 'transaction'y-ness from the current setup. For example, what happens if
phase1 and phase2 overlap their execution? It's not that I have evidence that this
will be a problem; it's more that we had these issues worked out fairly well with
rdopkg CI 0.1, and I think the change warrants some scrutiny/thought (which clearly I have
not done!).
Worth a try if you ask me. I'll gladly help with any scripts/helpers
needed, so just let me know.
> We'd still need to work out a way to execute phase2, though. There would be no
`rdopkg update` event to trigger phase2 runs. I'm not sure how we'd do that
without a BuildFlow. BuildFlow jobs also allow parallelization of the child jobs, and
I'm not sure how we could replicate that without using that type of job.
There can be rdopkg action (i.e. rdopkg update --phase2) to do whatever
you need if that helps.
So I suppose Phase 1 would be a reduced to a basic packstack job that
runs through w/ the update from the devels submission.
Phase 2 addressed what happens when a group of submissions are pooled
together. Does one submission break another submission that could not
have been detected in the individual submissions themselves. I see two
options here..
A. Let the issue get sorted out in the stage yum repo.
B. Create a temporary yum repo with collection of submissions with a job
hitting rdo production and the temporary repo.
Thoughts?
> Whew. I hope this was helpful. I'm saving a copy of this text to
http://slinabery.fedorapeople.org/rdopkg-overview.txt
Sure is, thanks!
Cheers,
Jakub
_______________________________________________
Rdo-list mailing list
Rdo-list(a)redhat.com
https://www.redhat.com/mailman/listinfo/rdo-list
To unsubscribe: rdo-list-unsubscribe(a)redhat.com