[rdo-list] [TripleO] Newton large baremetal deployment issues

Wed Nov 9 16:18:01 UTC 2016

Hi,

Just some feedback on this thread.

I have redeployed several times and have begun to suspect DNS as being 
the cause for delays (just a guess as the deployment always competes 
with no obvious errors)
I had a look at the local hosts files on the nodes during deployment and 
can see that lots of them (not all) are incorrectly formatted as they 
contain '\n'.

For example a small part of one hosts file -
<<
\n10.0.7.30 overcloud-novacompute-32.localdomain overcloud-novacompute-32
192.168.0.39 overcloud-novacompute-32.external.localdomain 
overcloud-novacompute-32.external
10.0.7.30 overcloud-novacompute-32.internalapi.localdomain 
overcloud-novacompute-32.internalapi
10.35.5.67 overcloud-novacompute-32.storage.localdomain 
overcloud-novacompute-32.storage
192.168.0.39 overcloud-novacompute-32.storagemgmt.localdomain 
overcloud-novacompute-32.storagemgmt
10.0.8.39 overcloud-novacompute-32.tenant.localdomain 
overcloud-novacompute-32.tenant
192.168.0.39 overcloud-novacompute-32.management.localdomain 
overcloud-novacompute-32.management
192.168.0.39 overcloud-novacompute-32.ctlplane.localdomain 
overcloud-novacompute-32.ctlplane
\n10.0.7.21 overcloud-novacompute-33.localdomain overcloud-novacompute-33
 >>

I wondered if maybe the image I was using was the issue so I tried the 
RH OSP9 official image -  Same hosts file formatting issues in deployment.
Maybe a workaround would be to change nsswitch.conf in the image to look 
up from DNS first  -  my Undercloud dnsmasq server - and have this 
populated with the correct entries from a node (once all nodes are 
pingable).

Charles

On 06/11/2016 23:25, Graeme Gillies wrote:
> Hi Charles,
>
> This definitely looks a bit strange to me, as we do deploys around 42
> nodes and it takes around 2 hours to do so, similar to your setup (1G
> link for provisoning, bonded 10G for everything else).
>
> Would it be possible for you to run an sosreport on your undercloud and
> provide it somewhere (if you are comfortable doing so). Also, can you
> show us the output of
>
> openstack stack list --nested
>
> And most importantly, if we can get a fully copy of the output of the
> overcloud deploy command, that has timestamps against when ever stack is
> created/finished, so we can try and narrow down where all the time is
> being spent.
>
> You note that you have quite a powerful undercloud (294GB of Memory and
> 64 cpus), and we have had issues in the past with very powerful
> underclouds, because the Openstack components try and tune themselves
> around the hardware they are running on and get it wrong for bigger servers.
>
> Are we able to get an output from "sar" or some other tool you are using
> to track cpu and memory usage during the deployment? I'd like to check
> those values look sane.
>
> Thanks in advance,
>
> Graeme
>
> On 05/11/16 01:31, Charles Short wrote:
>> Hi,
>>
>> Each node has 2X HP 900GB 12G SAS 10K 2.5in SC ENT HDD.
>> The 1Gb deployment NIC is not really causing the delay. It is very busy
>> for the time the overcloud image is rolled out (the first 30 to 45 mins
>> of deployment), but after that  (once all the nodes are up and active
>> with an ip address (pingable)) ,the bandwidth is a fraction of 1Gbps on
>> average for the rest of the deployment. For info the NICS in the nodes
>> for the Overcloud networks are dual bonded 10Gbit.
>>
>> The deployment I mentioned before (50 nodes) actually completed in 8
>> hours (which is double the time it took for 35 nodes!)
>>
>> I am in the process of a new  3 controller 59 compute node deployment
>> pinning all the nodes as you suggested. The initial overcloud image roll
>> out took just under 1 hour (all nodes ACTIVE and pingable). I am now 4.5
>> hours in and all is running (slowly). It is currently on Step2  (of 5
>> Steps). I would expect this deployment to take 10 hours on current speed.
>>
>> Regards
>>
>> Charles
>>
>> On 04/11/2016 15:17, Justin Kilpatrick wrote:
>>> Hey Charles,
>>>
>>> What sort of issues are you seeing now? How did node pinning work out
>>> and did a slow scale up present any more problems?
>>>
>>> Deployments tend to be disk and network limited, you don't mention
>>> what sort of disks your machines have but you do note 1g nics, which
>>> are doable but might require some timeout adjustments or other
>>> considerations to give everything time to complete.
>>>
>>> On Fri, Nov 4, 2016 at 10:45 AM, Charles Short <cems at ebi.ac.uk
>>> <mailto:cems at ebi.ac.uk>> wrote:
>>>
>>>      Hi,
>>>
>>>      So you are implying that tripleO is not really currently able to
>>>      roll out large deployments easily as it is is prone to scaling
>>>      delays/errors?
>>>      Is the same true for RH OSP9 (out of the box) as this also uses
>>>      tripleO?  I would expect exactly the same scaling issues. But
>>>      surely OSP9 is designed for large enterprise Openstack installations?
>>>      So if OSP9 does work well with large deployments, what are the
>>>      tripleO tweaks that make this work (if any)?
>>>
>>>      Many Thanks
>>>
>>>      Charles
>>>
>>>      On 03/11/2016 13:30, Justin Kilpatrick wrote:
>>>>      Hey Charles,
>>>>
>>>>      If you want to deploy a large number of machines, I suggest you
>>>>      deploy a small configuration (maybe 3 controllers 1 compute) and
>>>>      then run the overcloud deploy command again with 2 computes, so
>>>>      on and so forth until you reach your full allocation
>>>>
>>>>      Realistically you can probably do a stride of 5 computes each
>>>>      time, experiment with it a bit, as you get up to the full
>>>>      allocation of nodes you might run into a race condition bug with
>>>>      assigning computes to nodes and need to pin nodes (pinning is
>>>>      adding as an ironic property that overcloud-novacompute-0 goes
>>>>      here, 1 here, so on and so forth).
>>>>
>>>>      As for actually solving the deployment issues at scale (instead
>>>>      of this horrible hack) I'm looking into adding some robustness at
>>>>      the ironic or tripleo level to these operations. It sounds like
>>>>      you're running more into node assignment issues rather than pxe
>>>>      issues though.
>>>>
>>>>      2016-11-03 9:16 GMT-04:00 Luca 'remix_tj' Lorenzetto
>>>>      <lorenzetto.luca at gmail.com <mailto:lorenzetto.luca at gmail.com>>:
>>>>
>>>>          On Wed, Nov 2, 2016 at 8:30 PM, Charles Short <cems at ebi.ac.uk
>>>>          <mailto:cems at ebi.ac.uk>> wrote:
>>>>          > Some more testing of different amounts of nodes vs time
>>>>          taken for successful
>>>>          > deployments -
>>>>          >
>>>>          > 3 controller 3 compute = 1 hour
>>>>          > 3 controller 15 compute = 1 hour
>>>>          > 3 controller 25 compute  = 1 hour 45 mins
>>>>          > 3 controller 35 compute  = 4 hours
>>>>
>>>>          Hello,
>>>>
>>>>          i'm now preparing my deployment of 3+2 nodes. I'll check what you
>>>>          reported and give you some feedback.
>>>>
>>>>          Luca
>>>>
>>>>
>>>>          --
>>>>          "E' assurdo impiegare gli uomini di intelligenza eccellente
>>>>          per fare
>>>>          calcoli che potrebbero essere affidati a chiunque se si
>>>>          usassero delle
>>>>          macchine"
>>>>          Gottfried Wilhelm von Leibnitz, Filosofo e Matematico (1646-1716)
>>>>
>>>>          "Internet è la più grande biblioteca del mondo.
>>>>          Ma il problema è che i libri sono tutti sparsi sul pavimento"
>>>>          John Allen Paulos, Matematico (1945-vivente)
>>>>
>>>>          Luca 'remix_tj' Lorenzetto, http://www.remixtj.net ,
>>>>          <lorenzetto.luca at gmail.com <mailto:lorenzetto.luca at gmail.com>>
>>>>
>>>>          _______________________________________________
>>>>          rdo-list mailing list
>>>>          rdo-list at redhat.com <mailto:rdo-list at redhat.com>
>>>>          https://www.redhat.com/mailman/listinfo/rdo-list
>>>>          <https://www.redhat.com/mailman/listinfo/rdo-list>
>>>>
>>>>          To unsubscribe: rdo-list-unsubscribe at redhat.com
>>>>          <mailto:rdo-list-unsubscribe at redhat.com>
>>>>
>>>>
>>>      --
>>>      Charles Short
>>>      Cloud Engineer
>>>      Virtualization and Cloud Team
>>>      European Bioinformatics Institute (EMBL-EBI)
>>>      Tel: +44 (0)1223 494205 <tel:%2B44%20%280%291223%20494205>
>>>
>>>
>> -- 
>> Charles Short
>> Cloud Engineer
>> Virtualization and Cloud Team
>> European Bioinformatics Institute (EMBL-EBI)
>> Tel: +44 (0)1223 494205
>>
>>
>>
>> _______________________________________________
>> rdo-list mailing list
>> rdo-list at redhat.com
>> https://www.redhat.com/mailman/listinfo/rdo-list
>>
>> To unsubscribe: rdo-list-unsubscribe at redhat.com
>>
>

-- 
Charles Short
Cloud Engineer
Virtualization and Cloud Team
European Bioinformatics Institute (EMBL-EBI)
Tel: +44 (0)1223 494205