[rdo-list] [TripleO] Newton large baremetal deployment issues
Charles Short
cems at ebi.ac.uk
Wed Nov 2 09:44:05 UTC 2016
Hi,
I am running TripleO Newton stable release and am deploying on baremetal
with CentOS.
I have 64 nodes, and the Undercloud has plenty of resource as it is one
of the nodes with 294 GB Memory and 64 CPUs.
The provisioning network is 1Gbps
I have tried tuning the Undercloud using this tuning section in 10.7 as
a guide
https://access.redhat.com/documentation/en/red-hat-openstack-platform/9/paged/director-installation-and-usage/chapter-10-troubleshooting-director-issues
My Undercloud passes validations in Clapper
https://github.com/rthallisey/clapper
I am deploying with Network Isolation and 3 Controllers in HA.
If I create a stack with 3 Controllers and 3 compute nodes this takes
about 1 hour
If I create a stack with 3 Controllers and 15 compute nodes this takes
about 1 hour
Both stacks pass Clapper validations.
During deployment I can see that the first 20 to 30 mins is using all
the bandwidth available for the overcloud image deployment and them uses
hardly any bandwidth whilst the rest of the configuration takes place.
So I try a stack with 40 nodes. This is where I have issues.
I set the timeout to 4 hours and leave it over night to deploy.
It seems to timeout and fail to deploy due to the timeout every time.
During the 40 node deployment the overcloud image is distributed in
about 45 mins to all nodes and the all nodes appear ACTIVE and have an
IP address on the deployment network.
So it would appear that the rest of the low bandwidth configuration is
taking well over 3 hours to complete. This seems excessive
I have configured nova.conf for deployment concurrency (from the tuning
link above) and configured the heat.conf 'num_engine_workers' to be 32
taking in to account this bug
https://bugzilla.redhat.com/show_bug.cgi?id=1370516
So my question is how do I tune my Undercloud to speed up the deployment?
Looking at htop during deployment I can see heat is using many CPUs, but
the work pattern is NOT distributed. What typically happens is all the
CPUs are at 0 to 1 % used apart from one which is at 50 to 100%. This
one CPU id changes regularly, but there is no concurrent distributed
workload across all the CPUs that the heat processes are running on. Is
heat really multi-threaded, or does if have limitations so it can only
really do proper work on one CPU at a time (which I am seeing in htop)?
Thanks
Charles
--
Charles Short
Cloud Engineer
Virtualization and Cloud Team
European Bioinformatics Institute (EMBL-EBI)
Tel: +44 (0)1223 494205
More information about the dev
mailing list