[rdo-list] [TripleO] Newton large baremetal deployment issues

Fri Nov 11 18:44:19 UTC 2016

________________________________
From: rdo-list-bounces at redhat.com <rdo-list-bounces at redhat.com> on behalf of Charles Short <cems at ebi.ac.uk>
Sent: Friday, November 11, 2016 7:48 PM
To: rdo-list at redhat.com
Subject: Re: [rdo-list] [TripleO] Newton large baremetal deployment issues

Update -

I updated Undercloud to latest stable Newton release, and used the
provided CentOS Overcloud images. I first completed a small test
deployment no problem (3 controller 3 compute) .
I then deployed again with a larger environment (40 compute, 3 controllers).
When the nodes were up and ACTIVE/pingable early in deployment I checked
the hosts files. This time no formatting errors.

However during deployment there were lots of long pauses and I noticed
plenty of these sorts of messages in the nova logs during the pauses -

/var/log/nova/nova-compute.log:2016-11-11 19:56:07.322 7840 ERROR
nova.compute.manager [instance:
f6fd4127-fc94-4a36-9b3b-5e4f21bd08ed]     raise
exceptions.ConnectTimeout(msg)
/var/log/nova/nova-compute.log:2016-11-11 19:56:07.322 7840 ERROR
nova.compute.manager [instance: f6fd4127-fc94-4a36-9b3b-5e4f21bd08ed]
ConnectTimeout: Request to
http://192.168.0.1:9696/v2.0/ports.json?tenant_id=30401f505075414fbd700f028412977f&device_id=f6fd4127-fc94-4a36-9b3b-5e4f21bd08ed
timed out

While this was happening I could not use nova from the Undercloud at all-

source stackrc
nova list
ERROR (ClientException): The server has either erred or is incapable of
performing the requested operation. (HTTP 500) (Request-ID:
req-367c9ac2-6f27-4e71-a451-681c8c3d2ce5)

After 2 hours of deployment and only on Step 1 of 5 the deployment fails
with -

ERROR: Timed out waiting for a reply to message ID
971157a211e549998bb7a6f6e494688b

note - I have a timeout value way over 2 hours in the deployment
commands (2000)

Post failed deployment I still cannot use nova. Looks like the
Undercloud is very unhappy (same error as above)

The only way I can get the Undercloud working again is to restart all
services (restarting nova alone does not work)
sudo systemctl restart neutron*
sudo systemctl restart openstack*

What would happen ?
 # openstack stack delete overcloud
   Stack deleted cleanly
 # shutdown -r now

Boris.

I think I may try OSP9 as I am running out of ideas. Either that or
giving Openstack-Ansible a try.....

Charles

On 10/11/2016 08:20, Charles Short wrote:
> Hi,
>
> Deploy command here
>
> http://pastebin.com/xNZXTWPE
[http://pastebin.com/i/facebook.png]<http://pastebin.com/xNZXTWPE>

stack1 - Pastebin.com<http://pastebin.com/xNZXTWPE>
pastebin.com

>
> no output from rpm command.
>
> Yes re OSP9 images I was just interested how they behaved early on in
> the deployment before any puppet errors (cloud init etc).
> Not a good test, just morbid fascination out of desperation.
>
> No Windows involved, and I have not altered the main puppet template
> directory at all.
>
> I am going to try and update the Undercloud to the latest stable, use
> the provided images and see how that goes.
>
> If all else fails I will install OSP9 and consider myself exhausted
> from all the swimming upstream ;)
>
> Charles
>
> On 10/11/2016 03:13, Graeme Gillies wrote:
>> On 10/11/16 02:18, Charles Short wrote:
>>> Hi,
>>>
>>> Just some feedback on this thread.
>>>
>>> I have redeployed several times and have begun to suspect DNS as being
>>> the cause for delays (just a guess as the deployment always competes
>>> with no obvious errors)
>>> I had a look at the local hosts files on the nodes during deployment
>>> and
>>> can see that lots of them (not all) are incorrectly formatted as they
>>> contain '\n'.
>>>
>>> For example a small part of one hosts file -
>>> <<
>>> \n10.0.7.30 overcloud-novacompute-32.localdomain
>>> overcloud-novacompute-32
>>> 192.168.0.39 overcloud-novacompute-32.external.localdomain
>>> overcloud-novacompute-32.external
>>> 10.0.7.30 overcloud-novacompute-32.internalapi.localdomain
>>> overcloud-novacompute-32.internalapi
>>> 10.35.5.67 overcloud-novacompute-32.storage.localdomain
>>> overcloud-novacompute-32.storage
>>> 192.168.0.39 overcloud-novacompute-32.storagemgmt.localdomain
>>> overcloud-novacompute-32.storagemgmt
>>> 10.0.8.39 overcloud-novacompute-32.tenant.localdomain
>>> overcloud-novacompute-32.tenant
>>> 192.168.0.39 overcloud-novacompute-32.management.localdomain
>>> overcloud-novacompute-32.management
>>> 192.168.0.39 overcloud-novacompute-32.ctlplane.localdomain
>>> overcloud-novacompute-32.ctlplane
>>> \n10.0.7.21 overcloud-novacompute-33.localdomain
>>> overcloud-novacompute-33
>>> I wondered if maybe the image I was using was the issue so I tried the
>>> RH OSP9 official image -  Same hosts file formatting issues in
>>> deployment.
>>> Maybe a workaround would be to change nsswitch.conf in the image to
>>> look
>>> up from DNS first  -  my Undercloud dnsmasq server - and have this
>>> populated with the correct entries from a node (once all nodes are
>>> pingable).
>>>
>>> Charles
>> Hi Charles,
>>
>> If you are getting formatting issues in /etc/hosts, it's possible that
>> the templates directory you are using might have problems, especially if
>> it's been edited on windows machines. Are you using unmodified templates
>> from /usr/share/openstack-tripleo-heat-templates? Also note that RHOS 9
>> images will not match RDO Newton templates, as RHOS 9 is mitaka, and
>> overcloud images contain puppet modules which must sync with the
>> templates used on the undercloud.
>>
>> If you are using the templates in
>> /usr/share/openstack-tripleo-heat-templates, can you give the output (if
>> any) from
>>
>> rpm -V openstack-tripleo-heat-templates
>>
>> Also perhaps getting a copy of your full overcloud deploy command will
>> help shed some light as well.
>>
>> Thanks in advance,
>>
>> Graeme
>>
>>> On 06/11/2016 23:25, Graeme Gillies wrote:
>>>> Hi Charles,
>>>>
>>>> This definitely looks a bit strange to me, as we do deploys around 42
>>>> nodes and it takes around 2 hours to do so, similar to your setup (1G
>>>> link for provisoning, bonded 10G for everything else).
>>>>
>>>> Would it be possible for you to run an sosreport on your undercloud
>>>> and
>>>> provide it somewhere (if you are comfortable doing so). Also, can you
>>>> show us the output of
>>>>
>>>> openstack stack list --nested
>>>>
>>>> And most importantly, if we can get a fully copy of the output of the
>>>> overcloud deploy command, that has timestamps against when ever
>>>> stack is
>>>> created/finished, so we can try and narrow down where all the time is
>>>> being spent.
>>>>
>>>> You note that you have quite a powerful undercloud (294GB of Memory
>>>> and
>>>> 64 cpus), and we have had issues in the past with very powerful
>>>> underclouds, because the Openstack components try and tune themselves
>>>> around the hardware they are running on and get it wrong for bigger
>>>> servers.
>>>>
>>>> Are we able to get an output from "sar" or some other tool you are
>>>> using
>>>> to track cpu and memory usage during the deployment? I'd like to check
>>>> those values look sane.
>>>>
>>>> Thanks in advance,
>>>>
>>>> Graeme
>>>>
>>>> On 05/11/16 01:31, Charles Short wrote:
>>>>> Hi,
>>>>>
>>>>> Each node has 2X HP 900GB 12G SAS 10K 2.5in SC ENT HDD.
>>>>> The 1Gb deployment NIC is not really causing the delay. It is very
>>>>> busy
>>>>> for the time the overcloud image is rolled out (the first 30 to 45
>>>>> mins
>>>>> of deployment), but after that  (once all the nodes are up and active
>>>>> with an ip address (pingable)) ,the bandwidth is a fraction of
>>>>> 1Gbps on
>>>>> average for the rest of the deployment. For info the NICS in the
>>>>> nodes
>>>>> for the Overcloud networks are dual bonded 10Gbit.
>>>>>
>>>>> The deployment I mentioned before (50 nodes) actually completed in 8
>>>>> hours (which is double the time it took for 35 nodes!)
>>>>>
>>>>> I am in the process of a new  3 controller 59 compute node deployment
>>>>> pinning all the nodes as you suggested. The initial overcloud
>>>>> image roll
>>>>> out took just under 1 hour (all nodes ACTIVE and pingable). I am
>>>>> now 45
>>>>> hours in and all is running (slowly). It is currently on Step2  (of 5
>>>>> Steps). I would expect this deployment to take 10 hours on current
>>>>> speed.
>>>>>
>>>>> Regards
>>>>>
>>>>> Charles
>>>>>
>>>>> On 04/11/2016 15:17, Justin Kilpatrick wrote:
>>>>>> Hey Charles,
>>>>>>
>>>>>> What sort of issues are you seeing now? How did node pinning work
>>>>>> out
>>>>>> and did a slow scale up present any more problems?
>>>>>>
>>>>>> Deployments tend to be disk and network limited, you don't mention
>>>>>> what sort of disks your machines have but you do note 1g nics, which
>>>>>> are doable but might require some timeout adjustments or other
>>>>>> considerations to give everything time to complete.
>>>>>>
>>>>>> On Fri, Nov 4, 2016 at 10:45 AM, Charles Short <cems at ebi.ac.uk
>>>>>> <mailto:cems at ebi.ac.uk>> wrote:
>>>>>>
>>>>>>       Hi,
>>>>>>
>>>>>>       So you are implying that tripleO is not really currently
>>>>>> able to
>>>>>>       roll out large deployments easily as it is is prone to scaling
>>>>>>       delays/errors?
>>>>>>       Is the same true for RH OSP9 (out of the box) as this also
>>>>>> uses
>>>>>>       tripleO?  I would expect exactly the same scaling issues. But
>>>>>>       surely OSP9 is designed for large enterprise Openstack
>>>>>> installations?
>>>>>>       So if OSP9 does work well with large deployments, what are the
>>>>>>       tripleO tweaks that make this work (if any)?
>>>>>>
>>>>>>       Many Thanks
>>>>>>
>>>>>>       Charles
>>>>>>
>>>>>>       On 03/11/2016 13:30, Justin Kilpatrick wrote:
>>>>>>>       Hey Charles,
>>>>>>>
>>>>>>>       If you want to deploy a large number of machines, I
>>>>>>> suggest you
>>>>>>>       deploy a small configuration (maybe 3 controllers 1
>>>>>>> compute) and
>>>>>>>       then run the overcloud deploy command again with 2
>>>>>>> computes, so
>>>>>>>       on and so forth until you reach your full allocation
>>>>>>>
>>>>>>>       Realistically you can probably do a stride of 5 computes each
>>>>>>>       time, experiment with it a bit, as you get up to the full
>>>>>>>       allocation of nodes you might run into a race condition
>>>>>>> bug with
>>>>>>>       assigning computes to nodes and need to pin nodes (pinning is
>>>>>>>       adding as an ironic property that overcloud-novacompute-0
>>>>>>> goes
>>>>>>>       here, 1 here, so on and so forth).
>>>>>>>
>>>>>>>       As for actually solving the deployment issues at scale
>>>>>>> (instead
>>>>>>>       of this horrible hack) I'm looking into adding some
>>>>>>> robustness at
>>>>>>>       the ironic or tripleo level to these operations. It sounds
>>>>>>> like
>>>>>>>       you're running more into node assignment issues rather
>>>>>>> than pxe
>>>>>>>       issues though.
>>>>>>>
>>>>>>>       2016-11-03 9:16 GMT-04:00 Luca 'remix_tj' Lorenzetto
>>>>>>>       <lorenzetto.luca at gmail.com
>>>>>>> <mailto:lorenzetto.luca at gmail.com>>:
>>>>>>>
>>>>>>>           On Wed, Nov 2, 2016 at 8:30 PM, Charles Short
>>>>>>> <cems at ebi.acuk
>>>>>>>           <mailto:cems at ebi.ac.uk>> wrote:
>>>>>>>           > Some more testing of different amounts of nodes vs time
>>>>>>>           taken for successful
>>>>>>>           > deployments -
>>>>>>>           >
>>>>>>>           > 3 controller 3 compute = 1 hour
>>>>>>>           > 3 controller 15 compute = 1 hour
>>>>>>>           > 3 controller 25 compute  = 1 hour 45 mins
>>>>>>>           > 3 controller 35 compute  = 4 hours
>>>>>>>
>>>>>>>           Hello,
>>>>>>>
>>>>>>>           i'm now preparing my deployment of 3+2 nodes. I'll check
>>>>>>> what you
>>>>>>>           reported and give you some feedback.
>>>>>>>
>>>>>>>           Luca
>>>>>>>
>>>>>>>
>>>>>>>           --
>>>>>>>           "E' assurdo impiegare gli uomini di intelligenza
>>>>>>> eccellente
>>>>>>>           per fare
>>>>>>>           calcoli che potrebbero essere affidati a chiunque se si
>>>>>>>           usassero delle
>>>>>>>           macchine"
>>>>>>>           Gottfried Wilhelm von Leibnitz, Filosofo e Matematico
>>>>>>> (1646-1716)
>>>>>>>
>>>>>>>           "Internet è la più grande biblioteca del mondo.
>>>>>>>           Ma il problema è che i libri sono tutti sparsi sul
>>>>>>> pavimento"
>>>>>>>           John Allen Paulos, Matematico (1945-vivente)
>>>>>>>
>>>>>>>           Luca 'remix_tj' Lorenzetto, http://www.remixtj.net ,
>>>>>>>           <lorenzetto.luca at gmail.com
>>>>>>> <mailto:lorenzetto.luca at gmail.com>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>>           rdo-list mailing list
>>>>>>>           rdo-list at redhat.com <mailto:rdo-list at redhat.com>
>>>>>>> https://www.redhat.com/mailman/listinfo/rdo-list
>>>>>>> <https://www.redhat.com/mailman/listinfo/rdo-list>
>>>>>>>
>>>>>>>           To unsubscribe: rdo-list-unsubscribe at redhat.com
>>>>>>> <mailto:rdo-list-unsubscribe at redhat.com>
>>>>>>>
>>>>>>>
>>>>>>       --
>>>>>>       Charles Short
>>>>>>       Cloud Engineer
>>>>>>       Virtualization and Cloud Team
>>>>>>       European Bioinformatics Institute (EMBL-EBI)
>>>>>>       Tel: +44 (0)1223 494205 <tel:%2B44%20%280%291223%20494205>
>>>>>>
>>>>>>
>>>>> --
>>>>> Charles Short
>>>>> Cloud Engineer
>>>>> Virtualization and Cloud Team
>>>>> European Bioinformatics Institute (EMBL-EBI)
>>>>> Tel: +44 (0)1223 494205
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> rdo-list mailing list
>>>>> rdo-list at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/rdo-list
>>>>>
>>>>> To unsubscribe: rdo-list-unsubscribe at redhat.com
>>>>>
>>
>

--
Charles Short
Cloud Engineer
Virtualization and Cloud Team
European Bioinformatics Institute (EMBL-EBI)
Tel: +44 (0)1223 494205

_______________________________________________
rdo-list mailing list
rdo-list at redhat.com
https://www.redhat.com/mailman/listinfo/rdo-list

To unsubscribe: rdo-list-unsubscribe at redhat.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rdoproject.org/pipermail/dev/attachments/20161111/fb7cca9a/attachment.html>