[rdo-list] [TripleO] Newton large baremetal deployment issues

Thu Nov 10 03:13:44 UTC 2016

On 10/11/16 02:18, Charles Short wrote:
> Hi,
> 
> Just some feedback on this thread.
> 
> I have redeployed several times and have begun to suspect DNS as being
> the cause for delays (just a guess as the deployment always competes
> with no obvious errors)
> I had a look at the local hosts files on the nodes during deployment and
> can see that lots of them (not all) are incorrectly formatted as they
> contain '\n'.
> 
> For example a small part of one hosts file -
> <<
> \n10.0.7.30 overcloud-novacompute-32.localdomain overcloud-novacompute-32
> 192.168.0.39 overcloud-novacompute-32.external.localdomain
> overcloud-novacompute-32.external
> 10.0.7.30 overcloud-novacompute-32.internalapi.localdomain
> overcloud-novacompute-32.internalapi
> 10.35.5.67 overcloud-novacompute-32.storage.localdomain
> overcloud-novacompute-32.storage
> 192.168.0.39 overcloud-novacompute-32.storagemgmt.localdomain
> overcloud-novacompute-32.storagemgmt
> 10.0.8.39 overcloud-novacompute-32.tenant.localdomain
> overcloud-novacompute-32.tenant
> 192.168.0.39 overcloud-novacompute-32.management.localdomain
> overcloud-novacompute-32.management
> 192.168.0.39 overcloud-novacompute-32.ctlplane.localdomain
> overcloud-novacompute-32.ctlplane
> \n10.0.7.21 overcloud-novacompute-33.localdomain overcloud-novacompute-33
>>>
> 
> I wondered if maybe the image I was using was the issue so I tried the
> RH OSP9 official image -  Same hosts file formatting issues in deployment.
> Maybe a workaround would be to change nsswitch.conf in the image to look
> up from DNS first  -  my Undercloud dnsmasq server - and have this
> populated with the correct entries from a node (once all nodes are
> pingable).
> 
> Charles

Hi Charles,

If you are getting formatting issues in /etc/hosts, it's possible that
the templates directory you are using might have problems, especially if
it's been edited on windows machines. Are you using unmodified templates
from /usr/share/openstack-tripleo-heat-templates? Also note that RHOS 9
images will not match RDO Newton templates, as RHOS 9 is mitaka, and
overcloud images contain puppet modules which must sync with the
templates used on the undercloud.

If you are using the templates in
/usr/share/openstack-tripleo-heat-templates, can you give the output (if
any) from

rpm -V openstack-tripleo-heat-templates

Also perhaps getting a copy of your full overcloud deploy command will
help shed some light as well.

Thanks in advance,

Graeme

> 
> On 06/11/2016 23:25, Graeme Gillies wrote:
>> Hi Charles,
>>
>> This definitely looks a bit strange to me, as we do deploys around 42
>> nodes and it takes around 2 hours to do so, similar to your setup (1G
>> link for provisoning, bonded 10G for everything else).
>>
>> Would it be possible for you to run an sosreport on your undercloud and
>> provide it somewhere (if you are comfortable doing so). Also, can you
>> show us the output of
>>
>> openstack stack list --nested
>>
>> And most importantly, if we can get a fully copy of the output of the
>> overcloud deploy command, that has timestamps against when ever stack is
>> created/finished, so we can try and narrow down where all the time is
>> being spent.
>>
>> You note that you have quite a powerful undercloud (294GB of Memory and
>> 64 cpus), and we have had issues in the past with very powerful
>> underclouds, because the Openstack components try and tune themselves
>> around the hardware they are running on and get it wrong for bigger
>> servers.
>>
>> Are we able to get an output from "sar" or some other tool you are using
>> to track cpu and memory usage during the deployment? I'd like to check
>> those values look sane.
>>
>> Thanks in advance,
>>
>> Graeme
>>
>> On 05/11/16 01:31, Charles Short wrote:
>>> Hi,
>>>
>>> Each node has 2X HP 900GB 12G SAS 10K 2.5in SC ENT HDD.
>>> The 1Gb deployment NIC is not really causing the delay. It is very busy
>>> for the time the overcloud image is rolled out (the first 30 to 45 mins
>>> of deployment), but after that  (once all the nodes are up and active
>>> with an ip address (pingable)) ,the bandwidth is a fraction of 1Gbps on
>>> average for the rest of the deployment. For info the NICS in the nodes
>>> for the Overcloud networks are dual bonded 10Gbit.
>>>
>>> The deployment I mentioned before (50 nodes) actually completed in 8
>>> hours (which is double the time it took for 35 nodes!)
>>>
>>> I am in the process of a new  3 controller 59 compute node deployment
>>> pinning all the nodes as you suggested. The initial overcloud image roll
>>> out took just under 1 hour (all nodes ACTIVE and pingable). I am now 4.5
>>> hours in and all is running (slowly). It is currently on Step2  (of 5
>>> Steps). I would expect this deployment to take 10 hours on current
>>> speed.
>>>
>>> Regards
>>>
>>> Charles
>>>
>>> On 04/11/2016 15:17, Justin Kilpatrick wrote:
>>>> Hey Charles,
>>>>
>>>> What sort of issues are you seeing now? How did node pinning work out
>>>> and did a slow scale up present any more problems?
>>>>
>>>> Deployments tend to be disk and network limited, you don't mention
>>>> what sort of disks your machines have but you do note 1g nics, which
>>>> are doable but might require some timeout adjustments or other
>>>> considerations to give everything time to complete.
>>>>
>>>> On Fri, Nov 4, 2016 at 10:45 AM, Charles Short <cems at ebi.ac.uk
>>>> <mailto:cems at ebi.ac.uk>> wrote:
>>>>
>>>>      Hi,
>>>>
>>>>      So you are implying that tripleO is not really currently able to
>>>>      roll out large deployments easily as it is is prone to scaling
>>>>      delays/errors?
>>>>      Is the same true for RH OSP9 (out of the box) as this also uses
>>>>      tripleO?  I would expect exactly the same scaling issues. But
>>>>      surely OSP9 is designed for large enterprise Openstack
>>>> installations?
>>>>      So if OSP9 does work well with large deployments, what are the
>>>>      tripleO tweaks that make this work (if any)?
>>>>
>>>>      Many Thanks
>>>>
>>>>      Charles
>>>>
>>>>      On 03/11/2016 13:30, Justin Kilpatrick wrote:
>>>>>      Hey Charles,
>>>>>
>>>>>      If you want to deploy a large number of machines, I suggest you
>>>>>      deploy a small configuration (maybe 3 controllers 1 compute) and
>>>>>      then run the overcloud deploy command again with 2 computes, so
>>>>>      on and so forth until you reach your full allocation
>>>>>
>>>>>      Realistically you can probably do a stride of 5 computes each
>>>>>      time, experiment with it a bit, as you get up to the full
>>>>>      allocation of nodes you might run into a race condition bug with
>>>>>      assigning computes to nodes and need to pin nodes (pinning is
>>>>>      adding as an ironic property that overcloud-novacompute-0 goes
>>>>>      here, 1 here, so on and so forth).
>>>>>
>>>>>      As for actually solving the deployment issues at scale (instead
>>>>>      of this horrible hack) I'm looking into adding some robustness at
>>>>>      the ironic or tripleo level to these operations. It sounds like
>>>>>      you're running more into node assignment issues rather than pxe
>>>>>      issues though.
>>>>>
>>>>>      2016-11-03 9:16 GMT-04:00 Luca 'remix_tj' Lorenzetto
>>>>>      <lorenzetto.luca at gmail.com <mailto:lorenzetto.luca at gmail.com>>:
>>>>>
>>>>>          On Wed, Nov 2, 2016 at 8:30 PM, Charles Short <cems at ebi.ac.uk
>>>>>          <mailto:cems at ebi.ac.uk>> wrote:
>>>>>          > Some more testing of different amounts of nodes vs time
>>>>>          taken for successful
>>>>>          > deployments -
>>>>>          >
>>>>>          > 3 controller 3 compute = 1 hour
>>>>>          > 3 controller 15 compute = 1 hour
>>>>>          > 3 controller 25 compute  = 1 hour 45 mins
>>>>>          > 3 controller 35 compute  = 4 hours
>>>>>
>>>>>          Hello,
>>>>>
>>>>>          i'm now preparing my deployment of 3+2 nodes. I'll check
>>>>> what you
>>>>>          reported and give you some feedback.
>>>>>
>>>>>          Luca
>>>>>
>>>>>
>>>>>          --
>>>>>          "E' assurdo impiegare gli uomini di intelligenza eccellente
>>>>>          per fare
>>>>>          calcoli che potrebbero essere affidati a chiunque se si
>>>>>          usassero delle
>>>>>          macchine"
>>>>>          Gottfried Wilhelm von Leibnitz, Filosofo e Matematico
>>>>> (1646-1716)
>>>>>
>>>>>          "Internet è la più grande biblioteca del mondo.
>>>>>          Ma il problema è che i libri sono tutti sparsi sul pavimento"
>>>>>          John Allen Paulos, Matematico (1945-vivente)
>>>>>
>>>>>          Luca 'remix_tj' Lorenzetto, http://www.remixtj.net ,
>>>>>          <lorenzetto.luca at gmail.com
>>>>> <mailto:lorenzetto.luca at gmail.com>>
>>>>>
>>>>>          _______________________________________________
>>>>>          rdo-list mailing list
>>>>>          rdo-list at redhat.com <mailto:rdo-list at redhat.com>
>>>>>          https://www.redhat.com/mailman/listinfo/rdo-list
>>>>>          <https://www.redhat.com/mailman/listinfo/rdo-list>
>>>>>
>>>>>          To unsubscribe: rdo-list-unsubscribe at redhat.com
>>>>>          <mailto:rdo-list-unsubscribe at redhat.com>
>>>>>
>>>>>
>>>>      --
>>>>      Charles Short
>>>>      Cloud Engineer
>>>>      Virtualization and Cloud Team
>>>>      European Bioinformatics Institute (EMBL-EBI)
>>>>      Tel: +44 (0)1223 494205 <tel:%2B44%20%280%291223%20494205>
>>>>
>>>>
>>> -- 
>>> Charles Short
>>> Cloud Engineer
>>> Virtualization and Cloud Team
>>> European Bioinformatics Institute (EMBL-EBI)
>>> Tel: +44 (0)1223 494205
>>>
>>>
>>>
>>> _______________________________________________
>>> rdo-list mailing list
>>> rdo-list at redhat.com
>>> https://www.redhat.com/mailman/listinfo/rdo-list
>>>
>>> To unsubscribe: rdo-list-unsubscribe at redhat.com
>>>
>>
> 

-- 
Graeme Gillies
Principal Systems Administrator
Openstack Infrastructure
Red Hat Australia