Some more testing of different amounts of nodes vs time taken for
successful deployments -
3 controller 3 compute = 1 hour
3 controller 15 compute = 1 hour
3 controller 25 compute = 1 hour 45 mins
3 controller 35 compute = 4 hours
Charles
On 02/11/2016 09:44, Charles Short wrote:
Hi,
I am running TripleO Newton stable release and am deploying on
baremetal with CentOS.
I have 64 nodes, and the Undercloud has plenty of resource as it is
one of the nodes with 294 GB Memory and 64 CPUs.
The provisioning network is 1Gbps
I have tried tuning the Undercloud using this tuning section in 10.7
as a guide
https://access.redhat.com/documentation/en/red-hat-openstack-platform/9/p...
My Undercloud passes validations in Clapper
https://github.com/rthallisey/clapper
I am deploying with Network Isolation and 3 Controllers in HA.
If I create a stack with 3 Controllers and 3 compute nodes this takes
about 1 hour
If I create a stack with 3 Controllers and 15 compute nodes this takes
about 1 hour
Both stacks pass Clapper validations.
During deployment I can see that the first 20 to 30 mins is using all
the bandwidth available for the overcloud image deployment and them
uses hardly any bandwidth whilst the rest of the configuration takes
place.
So I try a stack with 40 nodes. This is where I have issues.
I set the timeout to 4 hours and leave it over night to deploy.
It seems to timeout and fail to deploy due to the timeout every time.
During the 40 node deployment the overcloud image is distributed in
about 45 mins to all nodes and the all nodes appear ACTIVE and have an
IP address on the deployment network.
So it would appear that the rest of the low bandwidth configuration is
taking well over 3 hours to complete. This seems excessive
I have configured nova.conf for deployment concurrency (from the
tuning link above) and configured the heat.conf 'num_engine_workers'
to be 32 taking in to account this bug
https://bugzilla.redhat.com/show_bug.cgi?id=1370516
So my question is how do I tune my Undercloud to speed up the deployment?
Looking at htop during deployment I can see heat is using many CPUs,
but the work pattern is NOT distributed. What typically happens is all
the CPUs are at 0 to 1 % used apart from one which is at 50 to 100%.
This one CPU id changes regularly, but there is no concurrent
distributed workload across all the CPUs that the heat processes are
running on. Is heat really multi-threaded, or does if have limitations
so it can only really do proper work on one CPU at a time (which I am
seeing in htop)?
Thanks
Charles
--
Charles Short
Cloud Engineer
Virtualization and Cloud Team
European Bioinformatics Institute (EMBL-EBI)
Tel: +44 (0)1223 494205