Re: [rdo-list] [TripleO] Newton large baremetal deployment issues

Wednesday, 2 November 2016

Some more testing of different amounts of nodes vs time taken for 
successful deployments -

3 controller 3 compute = 1 hour
3 controller 15 compute = 1 hour
3 controller 25 compute  = 1 hour 45 mins
3 controller 35 compute  = 4 hours

Charles

On 02/11/2016 09:44, Charles Short wrote:
...
 Hi,

 I am running TripleO Newton stable release and am deploying on 
 baremetal with CentOS.
 I have 64 nodes, and the Undercloud has plenty of resource as it is 
 one of the nodes with 294 GB Memory and 64 CPUs.
 The provisioning network is 1Gbps

 I have tried tuning the Undercloud using this tuning section in 10.7 
 as a guide

https://access.redhat.com/documentation/en/red-hat-openstack-platform/9/p...

 My Undercloud passes validations in Clapper

 https://github.com/rthallisey/clapper

 I am deploying with Network Isolation and 3 Controllers in HA.

 If I create a stack with 3 Controllers and 3 compute nodes this takes 
 about 1 hour
 If I create a stack with 3 Controllers and 15 compute nodes this takes 
 about 1 hour
 Both stacks pass Clapper validations.

 During deployment I can see that the first 20 to 30 mins is using all 
 the bandwidth available for the overcloud image deployment and them 
 uses hardly any bandwidth whilst the rest of the configuration takes 
 place.

 So I try a stack with 40 nodes. This is where I have issues.
 I set the timeout to 4 hours and leave it over night to deploy.
 It seems to timeout and fail to deploy due to the timeout every time.

 During the 40 node deployment the overcloud image is distributed in 
 about 45 mins to all nodes and the all nodes appear ACTIVE and have an 
 IP address on the deployment network.
 So it would appear that the rest of the low bandwidth configuration is 
 taking well over 3 hours to complete. This seems excessive
 I have configured nova.conf for deployment concurrency (from the 
 tuning link above) and configured the heat.conf 'num_engine_workers' 
 to be 32 taking in to account this bug

 https://bugzilla.redhat.com/show_bug.cgi?id=1370516

 So my question is how do I tune my Undercloud to speed up the deployment?

 Looking at htop during deployment I can see heat is using many CPUs, 
 but the work pattern is NOT distributed. What typically happens is all 
 the CPUs are at 0 to 1 % used apart from one which is at 50 to 100%. 
 This one CPU id  changes regularly, but there is no concurrent 
 distributed workload across all the CPUs that the heat processes are 
 running on. Is heat really multi-threaded, or does if have limitations 
 so it can only really do proper work on one CPU at a time (which I am 
 seeing in htop)?

 Thanks

 Charles

-- 
Charles Short
Cloud Engineer
Virtualization and Cloud Team
European Bioinformatics Institute (EMBL-EBI)
Tel: +44 (0)1223 494205

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Re: [rdo-list] [TripleO] Newton large baremetal deployment issues