[rdo-list] [TripleO] Newton large baremetal deployment issues

Wed Nov 2 09:44:05 UTC 2016

Hi,

I am running TripleO Newton stable release and am deploying on baremetal 
with CentOS.
I have 64 nodes, and the Undercloud has plenty of resource as it is one 
of the nodes with 294 GB Memory and 64 CPUs.
The provisioning network is 1Gbps

I have tried tuning the Undercloud using this tuning section in 10.7 as 
a guide

https://access.redhat.com/documentation/en/red-hat-openstack-platform/9/paged/director-installation-and-usage/chapter-10-troubleshooting-director-issues

My Undercloud passes validations in Clapper

https://github.com/rthallisey/clapper

I am deploying with Network Isolation and 3 Controllers in HA.

If I create a stack with 3 Controllers and 3 compute nodes this takes 
about 1 hour
If I create a stack with 3 Controllers and 15 compute nodes this takes 
about 1 hour
Both stacks pass Clapper validations.

During deployment I can see that the first 20 to 30 mins is using all 
the bandwidth available for the overcloud image deployment and them uses 
hardly any bandwidth whilst the rest of the configuration takes place.

So I try a stack with 40 nodes. This is where I have issues.
I set the timeout to 4 hours and leave it over night to deploy.
It seems to timeout and fail to deploy due to the timeout every time.

During the 40 node deployment the overcloud image is distributed in 
about 45 mins to all nodes and the all nodes appear ACTIVE and have an 
IP address on the deployment network.
So it would appear that the rest of the low bandwidth configuration is 
taking well over 3 hours to complete. This seems excessive
I have configured nova.conf for deployment concurrency (from the tuning 
link above) and configured the heat.conf 'num_engine_workers' to be 32 
taking in to account this bug

https://bugzilla.redhat.com/show_bug.cgi?id=1370516

So my question is how do I tune my Undercloud to speed up the deployment?

Looking at htop during deployment I can see heat is using many CPUs, but 
the work pattern is NOT distributed. What typically happens is all the 
CPUs are at 0 to 1 % used apart from one which is at 50 to 100%. This 
one CPU id  changes regularly, but there is no concurrent distributed 
workload across all the CPUs that the heat processes are running on. Is 
heat really multi-threaded, or does if have limitations so it can only 
really do proper work on one CPU at a time (which I am seeing in htop)?

Thanks

Charles

-- 
Charles Short
Cloud Engineer
Virtualization and Cloud Team
European Bioinformatics Institute (EMBL-EBI)
Tel: +44 (0)1223 494205