On 5.2.2015 16:49, Steve Linabery wrote:
On Fri, Jan 30, 2015 at 08:53:25AM -0600, Steve Linabery wrote:
> On Fri, Jan 30, 2015 at 02:52:55PM +0100, Jakub Ruzicka wrote:
>> Very nice overview Steve, thanks for writing this down!
>>
>> My random thoughts on the matter inline.
>>
>> On 29.1.2015 22:48, Steve Linabery wrote:
>>> I have been struggling with the amount of information to convey and what
level of detail to include. Since I can't seem to get it perfect to my own
satisfaction, here is the imperfect (and long, sorry) version to begin discussion.
>>>
>>> This is an overview of where things stand (rdopkg CI 'v0.1').
>>
>> For some time, I'm wondering if we should really call it rdopkg CI since
>> it's not really tied to rdopkg but to RDO. You can use most of rdopkg on
>> any distgit. I reckon we should simply call it RDO CI to avoid
>> confusion. I for one don't underestimate the impact of naming stuff ;)
>>
>>> Terminology:
>>> 'Release' refers to an OpenStack release (e.g. havana,icehouse,juno)
>>> 'Dist' refers to a distro supported by RDO (e.g. fedora-20, epel-6,
epel-7)
>>> 'phase1' is the initial smoketest for an update submitted via `rdopkg
update`
>>> 'phase2' is a full-provision test for accumulated updates that have
passed phase1
>>> 'snapshot' means an OpenStack snapshot of a running instance, i.e. a
disk image created from a running OS instance.
>>>
>>> The very broad strokes:
>>> -----------------------
>>>
>>> rdopkg ci is triggered when a packager uses `rdopkg update`.
>>>
>>> When a review lands in the rdo-update gerrit project, a 'phase1'
smoketest is initiated via jenkins for each Release/Dist combination present in the update
(e.g. if the update contains builds for icehouse/fedora-20 and icehouse/epel-6, each set
of RPMs from each build will be smoketested on an instance running the associated
Release/Dist). If *all* supported builds from the update pass phase1, then the update is
merged into rdo-update. Updates that pass phase1 accumulate in the updates/ directory in
the rdo-update project.
>>>
>>> Periodically, a packager may run 'phase2'. This takes everything in
updates/ and uses those RPMs + RDO production repo to provision a set of base images with
packstack aio. Again, a simple tempest test is run against the packstack aio instances. If
all pass, then phase2 passes, and the `rdopkg update` yaml files are moved from updates/
to ready/.
>>>
>>> At that point, someone with the keys to the stage repos will push the builds
in ready/ to the stage repo. If CI against stage repo passes, stage is rsynced to
production.
>>>
>>> Complexity, Part 1:
>>> -------------------
>>>
>>> Rdopkg CI v0.1 was designed around the use of OpenStack VM disk snapshots. On
a periodic basis, we provision two nodes for each supported combination in [Releases] X
[Dists] (e.g. "icehouse, fedora-20" "juno, epel-7" etc). One node is a
packstack aio instance built against RDO production repos, and the other is a node running
tempest. After a simple tempest test passes for all the packstack aio nodes, we would
snapshot the set of instances. Then when we want to do a 'phase1' test for e.g.
"icehouse, fedora-20", we can spin up the instances previously snapshotted and
save the time of re-running packstack aio.
>>>
>>> Using snapshots saves approximately 30 min of wait time per test run by
skipping provisioning. Using snapshots imposes a few substantial costs/complexities
though. First and most significant, snapshots need to be reinstantiated using the same IP
addresses that were present when packstack and tempest were run during the provisioning.
This means we have to have concurrency control around running only one phase1 run at a
time; otherwise an instance might fail to provision because its 'static' IP
address is already in use by another run. The second cost is that in practice, a) our
OpenStack infrastructure has been unreliable, b) not all Release/Dist combinations
reliably provision. So it becomes hard to create a full set of snapshots reliably.
>>>
>>> Additionally, some updates (e.g. when an update comes in for
openstack-puppet-modules) prevent the use of a previously-provisioned packstack instance.
Continuing with the o-p-m example: that package is used for provisioning. So simply
updating the RPM for that package after running packstack aio doesn't tell us anything
about the package sanity (other than perhaps if a new, unsatisfied RPM dependency was
introduced).
>>>
>>> Another source of complexity comes from the nature of the rdopkg update
'unit'. Each yaml file created by `rdopkg update` can contain multiple builds for
different Release,Dist combinations. So there must be a way to 'collate' the
results of each smoketest for each Release,Dist and pass phase1 only if all updates pass.
Furthermore, some combinations of Release,Dist are known (at times, for various ad hoc
reasons) to fail testing, and those combinations sometimes need to be 'disabled'.
For example, if we know that icehouse/f20 is 'red' on a given day, we might want
an update containing icehouse/fedora-20,icehouse/epel-6 to test only the icehouse/epel-6
combination and pass if that passes.
>>>
>>> Finally, pursuant to the previous point, there need to be 'control'
structure jobs for provision/snapshot, phase1, and phase2 runs that pass (and perform some
action upon passing) only when all their 'child' jobs have passed.
>>>
>>> The way we have managed this complexity to date is through the use of the
jenkins BuildFlow plugin. Here's some ASCII art (courtesy of 'tree') to show
how the jobs are structured now (these are descriptive job names, not the actual jenkins
job names). BuildFlow jobs are indicated by (bf).
>>>
>>> .
>>> `-- rdopkg_master_flow (bf)
>>> |-- provision_and_snapshot (bf)
>>> | |-- provision_and_snapshot_icehouse_epel6
>>> | |-- provision_and_snapshot_icehouse_f20
>>> | |-- provision_and_snapshot_juno_epel7
>>> | `-- provision_and_snapshot_juno_f21
>>> |-- phase1_flow (bf)
>>> | |-- phase1_test_icehouse_f20
>>> | `-- phase1_test_juno_f21
>>> `-- phase2_flow (bf)
>>> |-- phase2_test_icehouse_epel6
>>> |-- phase2_test_icehouse_f20
>>> |-- phase2_test_juno_epel7
>>> `-- phase2_test_juno_f21
>>
>> As a consumer of CI results, my main problem with this is it takes about
>> 7 clicks to get to the actual error.
>>
>>
>>> When a change comes in from `rdopkg update`, the rdopkg_master_flow job is
triggered. It's the only job that gets triggered from gerrit, so it kicks off
phase1_flow. phase1_flow runs 'child' jobs (normal jenkins jobs, not buildflow)
for each Release,Dist combination present in the update.
>>>
>>> provision_and_snapshot is run by manually setting a build parameter
(BUILD_SNAPS) in the rdopkg_master_flow job, and triggering the build of
rdopkg_master_flow.
>>>
>>> phase2 is invoked similar to the provision_and_snapshot build, by checking
'RUN_PHASE2' in the rdopkg_master_flow build parameters before executing a build
thereof.
>>>
>>> Concurrency control is a side effect of requiring the user or gerrit to
execute rdopkg_master_flow for every action. There can be only one rdopkg_master_flow
build executing at any given time.
>>>
>>> Complexity, Part 2:
>>> -------------------
>>>
>>> In addition to the nasty complexity of using nested BuildFlow type jobs, each
'worker' job (i.e. the non-buildflow type jobs) has some built in complexity that
is reflected in the amount of logic in each job's bash script definition.
>>>
>>> Some of this has been alluded to in previous points. For instance, each job
in the phase1 flow needs to determine, for each update, if the update contains a package
that requires full packstack aio provisioning from a base image (e.g.
openstack-puppet-modules). This 'must provision' list needs to be stored somewhere
that all jobs can read it, and it needs to be dynamic enough to add to it as requirements
dictate.
>>>
>>> But additionally, for package sets not requiring provisioning a base image,
phase1 job needs to query the backing OpenStack instance to see if there exists a
'known good' snapshot, get the images' UUIDs from OpenStack, and spin up the
instances using the snapshot images.
>>>
>>> This baked-in complexity in the 'worker' jenkins jobs has made it
difficult to maintain the job definitions, and more importantly difficult to run using jjb
or in other more 'orthodox' CI-type ways. The rdopkg CI stuff is a bastard child
of a fork. It lives in its own mutant gene pool.
>>
>> lolololol
>>
>>
>>> A Way Forward...?
>>> ----------------
>>>
>>> Wes Hayutin had a good idea that might help reduce some of the complexity
here as we contemplate a) making rdopkg CI public, b) moving toward rdopkg CI 0.2.
>>>
>>> His idea was a) stop using snapshots since the per-test-run savings
doesn't seem to justify the burden they create, b) do away with BuildFlow by including
the 'this update contains builds for (Release1,Dist2),...,(ReleaseN,DistM)'
information in the gerrit change topic.
>>
>> It's easy to modify `rdopkg update` to include this information.
>> However, it's redundant so you can (in theory) submit an update where
>> this summary won't match the actual YAML data. That's probably
>> completely irrelevant, but I'm mentioning it nonetheless :)
>>
>
> This is less a response to Jakub's comment here and more an additional
explanation of why this idea is so nice.
>
> Currently, when phase1_flow is triggered, it ssh's to a separate host to run
`rdoupdate check` (because BuildFlow jobs execute on the jenkins master node, disregarding
any setting to run them on a particular slave), and parse the output to determine what
Release,Dist combinations need to be tested.
>
> The gerrit topic approach would allow us to have N jobs listening to gerrit trigger
events, but e.g. the juno/epel-7 job would only execute if the gerrit topic matched that
job's regexp. The gerrit review would only get its +2 when these jobs complete
successfully.
>
> It would be nice to decide what that topic string ought to look like so that a) we
have sanity in the regexp, b) we are sure gerrit will support long strings with whatever
odd chars we may wish to use, etc.
>
I'll propose a format for the gerrit topic string. Let's say a build includes
updates for:
icehouse,fedora-20
juno,fedora-21
juno,epel-7
The resulting topic string would be:
icehouse_fedora-20/juno_fedora-21_epel-7
Then a job triggered off gerrit would have a regex like '.*$release[^/]+$dist.*'
So for example, the icehouse/fedora-20 phase1 job would have
'.*icehouse[^/]+fedora-20.*' which would match the gerrit topic, so that test
would run.
Jakub, could we have `rdopkg update` generate the topic string as indicated based on the
contents of the update?
Hey,
I just pushed the requested change to rdopkg. It will be included in the
next rdopkg release (0.25).
https://github.com/redhat-openstack/rdopkg/commit/09bd0dfc9ed928123e94b07...
Cheers
Jakub