[rdo-list] Building overcloud images with TripleO and RDO

Fri Aug 12 03:36:09 UTC 2016

Hi,

I spent the last day or two trying to get to the bottom of the issue
described at [1], which turned out to be because the version of galera
that is in EPEL is higher than what we have in RDO mitaka stable, and
when it attempts to get used, mariadb-galera-server fails to start.

In order to understand why epel was being pulled in, how to stop it, and
how this seemed to have slipped through CI/testing, I've been trying to
look through and understand the whole state of the image building
process across TripleO, RDO, and our CI.

Unfortunately what I've discovered hasn't been great. It looks like
there is at least 3 different paths being used to build images.
Apologies if anything below is incorrect, it's incredibly convoluted and
difficult to follow for someone who isn't intimately familiar with it
all (like myself).

1) Using "openstack overcloud image build --all", which is I assume the
method end users are supposed to be using, or at least it's the method
documented in the docs. This uses diskimagebuilder under the hood, but
the logic behind it is in python (under python-tripleoclient), with a
lot of stuff hardcoded in

2) Using tripleo.sh, which, while it looks like calls "openstack
overcloud image build", also has some of it's own logic and messes with
things like the ~/.cache/image-create/source-repositories file, which I
believe is how the issue at [1] passed CI in the first place

3) Using the ansible role ansible-role-tripleo-image-build [2] which
looks like it also uses diskimagebuilder, but through a slightly
different approach, by using an ansible library that can take an image
definition via yaml (neat!) and then all diskimagebuilder using
python-tripleo-common as an intermediary. Which is a different code path
(though the code itself looks similar) to python-tripleoclient

I feel this issue is hugely important as I believe it is one of the
biggest barriers to having more people adopt RDO/TripleO. Too often
people encounter issues with deploys that are hard to nail down because
we have no real understanding exactly how they built the images, nor as
an Operator I don't feel like I have a clear understanding of what I get
when I use different options. The bug at [1] is a classic example of
something I should never have hit.

We do have stable images available at [3] (built using method 3) however
there are a number of problems with just using them

1) I think it's perfectly reasonable for people to want to build their
own images. It's part of the Open Source philosophy, we want things to
be Open and we want to understand how things work, so we can customise,
extend, and troubleshoot ourselves. If your image building process is so
convoluted that you have to say "just use our prebuilt ones", then you
have done something wrong.

2) The images don't get updated (they current mitaka ones were built in
April)

3) There is actually nowhere on the RDO website, nor the tripleo
website, that actually references their location. So as a new user, you
have exactly zero chance of finding these images and using them.

I'm not sure what the best process is to start improving this, but it
looks like it's complicated enough and involves enough moving pieces
that a spec against tripleo might be the way to go? I am thinking the
goal would be to move towards everyone having one way, one code path,
for building images with TripleO, that could be utilised by all use
cases out there.

My thinking is the method would take image definitions in a yaml format
similar to how ansible-role-tripleo-image-build works, and we can just
ship a bunch of different yaml files for all the different image
scenarios people might want. e.g.

/usr/share/tripleo-images/centos-7-x86_64-mitaka-cbs.yaml
/usr/share/tripleo-images/centos-7-x86_64-mitaka-trunk.yaml
/usr/share/tripleo-images/centos-7-x86_64-trunk.yaml

Etc etc. you could then have a symlink called default.yaml which points
to whatever scenario you wish people to use by default, and the scenario
could be overwritten by a command line argument. Basically this is
exactly how mock [4] works, and has been proven to be a nice, clean,
easy to use workflow for people to understand. The added bonus is if
people wanted to do their own images, they could copy one of the
existing files as a template to start with.

If people feel this is worthwhile (and I hope they do), I'm interested
in understanding what the next steps would be to get this to happen.

Regards,

Graeme

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1365884
[2] https://github.com/redhat-openstack/ansible-role-tripleo-image-build
[3]
http://buildlogs.centos.org/centos/7/cloud/x86_64/tripleo_images/mitaka/cbs/
[4] https://fedoraproject.org/wiki/Mock
-- 
Graeme Gillies
Principal Systems Administrator
Openstack Infrastructure
Red Hat Australia