[rdo-list] [TripleO] Newton large baremetal deployment issues

Sun Nov 6 23:25:29 UTC 2016

Hi Charles,

This definitely looks a bit strange to me, as we do deploys around 42
nodes and it takes around 2 hours to do so, similar to your setup (1G
link for provisoning, bonded 10G for everything else).

Would it be possible for you to run an sosreport on your undercloud and
provide it somewhere (if you are comfortable doing so). Also, can you
show us the output of

openstack stack list --nested

And most importantly, if we can get a fully copy of the output of the
overcloud deploy command, that has timestamps against when ever stack is
created/finished, so we can try and narrow down where all the time is
being spent.

You note that you have quite a powerful undercloud (294GB of Memory and
64 cpus), and we have had issues in the past with very powerful
underclouds, because the Openstack components try and tune themselves
around the hardware they are running on and get it wrong for bigger servers.

Are we able to get an output from "sar" or some other tool you are using
to track cpu and memory usage during the deployment? I'd like to check
those values look sane.

Thanks in advance,

Graeme

On 05/11/16 01:31, Charles Short wrote:
> Hi,
> 
> Each node has 2X HP 900GB 12G SAS 10K 2.5in SC ENT HDD.
> The 1Gb deployment NIC is not really causing the delay. It is very busy
> for the time the overcloud image is rolled out (the first 30 to 45 mins
> of deployment), but after that  (once all the nodes are up and active
> with an ip address (pingable)) ,the bandwidth is a fraction of 1Gbps on
> average for the rest of the deployment. For info the NICS in the nodes
> for the Overcloud networks are dual bonded 10Gbit.
> 
> The deployment I mentioned before (50 nodes) actually completed in 8
> hours (which is double the time it took for 35 nodes!)
> 
> I am in the process of a new  3 controller 59 compute node deployment
> pinning all the nodes as you suggested. The initial overcloud image roll
> out took just under 1 hour (all nodes ACTIVE and pingable). I am now 4.5
> hours in and all is running (slowly). It is currently on Step2  (of 5
> Steps). I would expect this deployment to take 10 hours on current speed.
> 
> Regards
> 
> Charles
> 
> On 04/11/2016 15:17, Justin Kilpatrick wrote:
>> Hey Charles,
>>
>> What sort of issues are you seeing now? How did node pinning work out
>> and did a slow scale up present any more problems?
>>
>> Deployments tend to be disk and network limited, you don't mention
>> what sort of disks your machines have but you do note 1g nics, which
>> are doable but might require some timeout adjustments or other
>> considerations to give everything time to complete.
>>
>> On Fri, Nov 4, 2016 at 10:45 AM, Charles Short <cems at ebi.ac.uk
>> <mailto:cems at ebi.ac.uk>> wrote:
>>
>>     Hi,
>>
>>     So you are implying that tripleO is not really currently able to
>>     roll out large deployments easily as it is is prone to scaling
>>     delays/errors?
>>     Is the same true for RH OSP9 (out of the box) as this also uses
>>     tripleO?  I would expect exactly the same scaling issues. But
>>     surely OSP9 is designed for large enterprise Openstack installations?
>>     So if OSP9 does work well with large deployments, what are the
>>     tripleO tweaks that make this work (if any)?
>>
>>     Many Thanks
>>
>>     Charles
>>
>>     On 03/11/2016 13:30, Justin Kilpatrick wrote:
>>>     Hey Charles,
>>>
>>>     If you want to deploy a large number of machines, I suggest you
>>>     deploy a small configuration (maybe 3 controllers 1 compute) and
>>>     then run the overcloud deploy command again with 2 computes, so
>>>     on and so forth until you reach your full allocation
>>>
>>>     Realistically you can probably do a stride of 5 computes each
>>>     time, experiment with it a bit, as you get up to the full
>>>     allocation of nodes you might run into a race condition bug with
>>>     assigning computes to nodes and need to pin nodes (pinning is
>>>     adding as an ironic property that overcloud-novacompute-0 goes
>>>     here, 1 here, so on and so forth).
>>>
>>>     As for actually solving the deployment issues at scale (instead
>>>     of this horrible hack) I'm looking into adding some robustness at
>>>     the ironic or tripleo level to these operations. It sounds like
>>>     you're running more into node assignment issues rather than pxe
>>>     issues though.
>>>
>>>     2016-11-03 9:16 GMT-04:00 Luca 'remix_tj' Lorenzetto
>>>     <lorenzetto.luca at gmail.com <mailto:lorenzetto.luca at gmail.com>>:
>>>
>>>         On Wed, Nov 2, 2016 at 8:30 PM, Charles Short <cems at ebi.ac.uk
>>>         <mailto:cems at ebi.ac.uk>> wrote:
>>>         > Some more testing of different amounts of nodes vs time
>>>         taken for successful
>>>         > deployments -
>>>         >
>>>         > 3 controller 3 compute = 1 hour
>>>         > 3 controller 15 compute = 1 hour
>>>         > 3 controller 25 compute  = 1 hour 45 mins
>>>         > 3 controller 35 compute  = 4 hours
>>>
>>>         Hello,
>>>
>>>         i'm now preparing my deployment of 3+2 nodes. I'll check what you
>>>         reported and give you some feedback.
>>>
>>>         Luca
>>>
>>>
>>>         --
>>>         "E' assurdo impiegare gli uomini di intelligenza eccellente
>>>         per fare
>>>         calcoli che potrebbero essere affidati a chiunque se si
>>>         usassero delle
>>>         macchine"
>>>         Gottfried Wilhelm von Leibnitz, Filosofo e Matematico (1646-1716)
>>>
>>>         "Internet è la più grande biblioteca del mondo.
>>>         Ma il problema è che i libri sono tutti sparsi sul pavimento"
>>>         John Allen Paulos, Matematico (1945-vivente)
>>>
>>>         Luca 'remix_tj' Lorenzetto, http://www.remixtj.net ,
>>>         <lorenzetto.luca at gmail.com <mailto:lorenzetto.luca at gmail.com>>
>>>
>>>         _______________________________________________
>>>         rdo-list mailing list
>>>         rdo-list at redhat.com <mailto:rdo-list at redhat.com>
>>>         https://www.redhat.com/mailman/listinfo/rdo-list
>>>         <https://www.redhat.com/mailman/listinfo/rdo-list>
>>>
>>>         To unsubscribe: rdo-list-unsubscribe at redhat.com
>>>         <mailto:rdo-list-unsubscribe at redhat.com>
>>>
>>>
>>
>>     -- 
>>     Charles Short
>>     Cloud Engineer
>>     Virtualization and Cloud Team
>>     European Bioinformatics Institute (EMBL-EBI)
>>     Tel: +44 (0)1223 494205 <tel:%2B44%20%280%291223%20494205> 
>>
>>
> 
> -- 
> Charles Short
> Cloud Engineer
> Virtualization and Cloud Team
> European Bioinformatics Institute (EMBL-EBI)
> Tel: +44 (0)1223 494205 
> 
> 
> 
> _______________________________________________
> rdo-list mailing list
> rdo-list at redhat.com
> https://www.redhat.com/mailman/listinfo/rdo-list
> 
> To unsubscribe: rdo-list-unsubscribe at redhat.com
> 

-- 
Graeme Gillies
Principal Systems Administrator
Openstack Infrastructure
Red Hat Australia