[rdo-list] [TripleO] Newton large baremetal deployment issues

Thu Nov 10 08:20:27 UTC 2016

Hi,

Deploy command here

http://pastebin.com/xNZXTWPE

no output from rpm command.

Yes re OSP9 images I was just interested how they behaved early on in 
the deployment before any puppet errors (cloud init etc).
Not a good test, just morbid fascination out of desperation.

No Windows involved, and I have not altered the main puppet template  
directory at all.

I am going to try and update the Undercloud to the latest stable, use 
the provided images and see how that goes.

If all else fails I will install OSP9 and consider myself exhausted from 
all the swimming upstream ;)

Charles

On 10/11/2016 03:13, Graeme Gillies wrote:
> On 10/11/16 02:18, Charles Short wrote:
>> Hi,
>>
>> Just some feedback on this thread.
>>
>> I have redeployed several times and have begun to suspect DNS as being
>> the cause for delays (just a guess as the deployment always competes
>> with no obvious errors)
>> I had a look at the local hosts files on the nodes during deployment and
>> can see that lots of them (not all) are incorrectly formatted as they
>> contain '\n'.
>>
>> For example a small part of one hosts file -
>> <<
>> \n10.0.7.30 overcloud-novacompute-32.localdomain overcloud-novacompute-32
>> 192.168.0.39 overcloud-novacompute-32.external.localdomain
>> overcloud-novacompute-32.external
>> 10.0.7.30 overcloud-novacompute-32.internalapi.localdomain
>> overcloud-novacompute-32.internalapi
>> 10.35.5.67 overcloud-novacompute-32.storage.localdomain
>> overcloud-novacompute-32.storage
>> 192.168.0.39 overcloud-novacompute-32.storagemgmt.localdomain
>> overcloud-novacompute-32.storagemgmt
>> 10.0.8.39 overcloud-novacompute-32.tenant.localdomain
>> overcloud-novacompute-32.tenant
>> 192.168.0.39 overcloud-novacompute-32.management.localdomain
>> overcloud-novacompute-32.management
>> 192.168.0.39 overcloud-novacompute-32.ctlplane.localdomain
>> overcloud-novacompute-32.ctlplane
>> \n10.0.7.21 overcloud-novacompute-33.localdomain overcloud-novacompute-33
>> I wondered if maybe the image I was using was the issue so I tried the
>> RH OSP9 official image -  Same hosts file formatting issues in deployment.
>> Maybe a workaround would be to change nsswitch.conf in the image to look
>> up from DNS first  -  my Undercloud dnsmasq server - and have this
>> populated with the correct entries from a node (once all nodes are
>> pingable).
>>
>> Charles
> Hi Charles,
>
> If you are getting formatting issues in /etc/hosts, it's possible that
> the templates directory you are using might have problems, especially if
> it's been edited on windows machines. Are you using unmodified templates
> from /usr/share/openstack-tripleo-heat-templates? Also note that RHOS 9
> images will not match RDO Newton templates, as RHOS 9 is mitaka, and
> overcloud images contain puppet modules which must sync with the
> templates used on the undercloud.
>
> If you are using the templates in
> /usr/share/openstack-tripleo-heat-templates, can you give the output (if
> any) from
>
> rpm -V openstack-tripleo-heat-templates
>
> Also perhaps getting a copy of your full overcloud deploy command will
> help shed some light as well.
>
> Thanks in advance,
>
> Graeme
>
>> On 06/11/2016 23:25, Graeme Gillies wrote:
>>> Hi Charles,
>>>
>>> This definitely looks a bit strange to me, as we do deploys around 42
>>> nodes and it takes around 2 hours to do so, similar to your setup (1G
>>> link for provisoning, bonded 10G for everything else).
>>>
>>> Would it be possible for you to run an sosreport on your undercloud and
>>> provide it somewhere (if you are comfortable doing so). Also, can you
>>> show us the output of
>>>
>>> openstack stack list --nested
>>>
>>> And most importantly, if we can get a fully copy of the output of the
>>> overcloud deploy command, that has timestamps against when ever stack is
>>> created/finished, so we can try and narrow down where all the time is
>>> being spent.
>>>
>>> You note that you have quite a powerful undercloud (294GB of Memory and
>>> 64 cpus), and we have had issues in the past with very powerful
>>> underclouds, because the Openstack components try and tune themselves
>>> around the hardware they are running on and get it wrong for bigger
>>> servers.
>>>
>>> Are we able to get an output from "sar" or some other tool you are using
>>> to track cpu and memory usage during the deployment? I'd like to check
>>> those values look sane.
>>>
>>> Thanks in advance,
>>>
>>> Graeme
>>>
>>> On 05/11/16 01:31, Charles Short wrote:
>>>> Hi,
>>>>
>>>> Each node has 2X HP 900GB 12G SAS 10K 2.5in SC ENT HDD.
>>>> The 1Gb deployment NIC is not really causing the delay. It is very busy
>>>> for the time the overcloud image is rolled out (the first 30 to 45 mins
>>>> of deployment), but after that  (once all the nodes are up and active
>>>> with an ip address (pingable)) ,the bandwidth is a fraction of 1Gbps on
>>>> average for the rest of the deployment. For info the NICS in the nodes
>>>> for the Overcloud networks are dual bonded 10Gbit.
>>>>
>>>> The deployment I mentioned before (50 nodes) actually completed in 8
>>>> hours (which is double the time it took for 35 nodes!)
>>>>
>>>> I am in the process of a new  3 controller 59 compute node deployment
>>>> pinning all the nodes as you suggested. The initial overcloud image roll
>>>> out took just under 1 hour (all nodes ACTIVE and pingable). I am now 4.5
>>>> hours in and all is running (slowly). It is currently on Step2  (of 5
>>>> Steps). I would expect this deployment to take 10 hours on current
>>>> speed.
>>>>
>>>> Regards
>>>>
>>>> Charles
>>>>
>>>> On 04/11/2016 15:17, Justin Kilpatrick wrote:
>>>>> Hey Charles,
>>>>>
>>>>> What sort of issues are you seeing now? How did node pinning work out
>>>>> and did a slow scale up present any more problems?
>>>>>
>>>>> Deployments tend to be disk and network limited, you don't mention
>>>>> what sort of disks your machines have but you do note 1g nics, which
>>>>> are doable but might require some timeout adjustments or other
>>>>> considerations to give everything time to complete.
>>>>>
>>>>> On Fri, Nov 4, 2016 at 10:45 AM, Charles Short <cems at ebi.ac.uk
>>>>> <mailto:cems at ebi.ac.uk>> wrote:
>>>>>
>>>>>       Hi,
>>>>>
>>>>>       So you are implying that tripleO is not really currently able to
>>>>>       roll out large deployments easily as it is is prone to scaling
>>>>>       delays/errors?
>>>>>       Is the same true for RH OSP9 (out of the box) as this also uses
>>>>>       tripleO?  I would expect exactly the same scaling issues. But
>>>>>       surely OSP9 is designed for large enterprise Openstack
>>>>> installations?
>>>>>       So if OSP9 does work well with large deployments, what are the
>>>>>       tripleO tweaks that make this work (if any)?
>>>>>
>>>>>       Many Thanks
>>>>>
>>>>>       Charles
>>>>>
>>>>>       On 03/11/2016 13:30, Justin Kilpatrick wrote:
>>>>>>       Hey Charles,
>>>>>>
>>>>>>       If you want to deploy a large number of machines, I suggest you
>>>>>>       deploy a small configuration (maybe 3 controllers 1 compute) and
>>>>>>       then run the overcloud deploy command again with 2 computes, so
>>>>>>       on and so forth until you reach your full allocation
>>>>>>
>>>>>>       Realistically you can probably do a stride of 5 computes each
>>>>>>       time, experiment with it a bit, as you get up to the full
>>>>>>       allocation of nodes you might run into a race condition bug with
>>>>>>       assigning computes to nodes and need to pin nodes (pinning is
>>>>>>       adding as an ironic property that overcloud-novacompute-0 goes
>>>>>>       here, 1 here, so on and so forth).
>>>>>>
>>>>>>       As for actually solving the deployment issues at scale (instead
>>>>>>       of this horrible hack) I'm looking into adding some robustness at
>>>>>>       the ironic or tripleo level to these operations. It sounds like
>>>>>>       you're running more into node assignment issues rather than pxe
>>>>>>       issues though.
>>>>>>
>>>>>>       2016-11-03 9:16 GMT-04:00 Luca 'remix_tj' Lorenzetto
>>>>>>       <lorenzetto.luca at gmail.com <mailto:lorenzetto.luca at gmail.com>>:
>>>>>>
>>>>>>           On Wed, Nov 2, 2016 at 8:30 PM, Charles Short <cems at ebi.ac.uk
>>>>>>           <mailto:cems at ebi.ac.uk>> wrote:
>>>>>>           > Some more testing of different amounts of nodes vs time
>>>>>>           taken for successful
>>>>>>           > deployments -
>>>>>>           >
>>>>>>           > 3 controller 3 compute = 1 hour
>>>>>>           > 3 controller 15 compute = 1 hour
>>>>>>           > 3 controller 25 compute  = 1 hour 45 mins
>>>>>>           > 3 controller 35 compute  = 4 hours
>>>>>>
>>>>>>           Hello,
>>>>>>
>>>>>>           i'm now preparing my deployment of 3+2 nodes. I'll check
>>>>>> what you
>>>>>>           reported and give you some feedback.
>>>>>>
>>>>>>           Luca
>>>>>>
>>>>>>
>>>>>>           --
>>>>>>           "E' assurdo impiegare gli uomini di intelligenza eccellente
>>>>>>           per fare
>>>>>>           calcoli che potrebbero essere affidati a chiunque se si
>>>>>>           usassero delle
>>>>>>           macchine"
>>>>>>           Gottfried Wilhelm von Leibnitz, Filosofo e Matematico
>>>>>> (1646-1716)
>>>>>>
>>>>>>           "Internet è la più grande biblioteca del mondo.
>>>>>>           Ma il problema è che i libri sono tutti sparsi sul pavimento"
>>>>>>           John Allen Paulos, Matematico (1945-vivente)
>>>>>>
>>>>>>           Luca 'remix_tj' Lorenzetto, http://www.remixtj.net ,
>>>>>>           <lorenzetto.luca at gmail.com
>>>>>> <mailto:lorenzetto.luca at gmail.com>>
>>>>>>
>>>>>>           _______________________________________________
>>>>>>           rdo-list mailing list
>>>>>>           rdo-list at redhat.com <mailto:rdo-list at redhat.com>
>>>>>>           https://www.redhat.com/mailman/listinfo/rdo-list
>>>>>>           <https://www.redhat.com/mailman/listinfo/rdo-list>
>>>>>>
>>>>>>           To unsubscribe: rdo-list-unsubscribe at redhat.com
>>>>>>           <mailto:rdo-list-unsubscribe at redhat.com>
>>>>>>
>>>>>>
>>>>>       --
>>>>>       Charles Short
>>>>>       Cloud Engineer
>>>>>       Virtualization and Cloud Team
>>>>>       European Bioinformatics Institute (EMBL-EBI)
>>>>>       Tel: +44 (0)1223 494205 <tel:%2B44%20%280%291223%20494205>
>>>>>
>>>>>
>>>> -- 
>>>> Charles Short
>>>> Cloud Engineer
>>>> Virtualization and Cloud Team
>>>> European Bioinformatics Institute (EMBL-EBI)
>>>> Tel: +44 (0)1223 494205
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> rdo-list mailing list
>>>> rdo-list at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/rdo-list
>>>>
>>>> To unsubscribe: rdo-list-unsubscribe at redhat.com
>>>>
>

-- 
Charles Short
Cloud Engineer
Virtualization and Cloud Team
European Bioinformatics Institute (EMBL-EBI)
Tel: +44 (0)1223 494205