[rdo-list] [TripleO] Newton large baremetal deployment issues
Charles Short
cems at ebi.ac.uk
Wed Nov 2 19:30:17 UTC 2016
Some more testing of different amounts of nodes vs time taken for
successful deployments -
3 controller 3 compute = 1 hour
3 controller 15 compute = 1 hour
3 controller 25 compute = 1 hour 45 mins
3 controller 35 compute = 4 hours
Charles
On 02/11/2016 09:44, Charles Short wrote:
> Hi,
>
> I am running TripleO Newton stable release and am deploying on
> baremetal with CentOS.
> I have 64 nodes, and the Undercloud has plenty of resource as it is
> one of the nodes with 294 GB Memory and 64 CPUs.
> The provisioning network is 1Gbps
>
> I have tried tuning the Undercloud using this tuning section in 10.7
> as a guide
>
> https://access.redhat.com/documentation/en/red-hat-openstack-platform/9/paged/director-installation-and-usage/chapter-10-troubleshooting-director-issues
>
>
> My Undercloud passes validations in Clapper
>
> https://github.com/rthallisey/clapper
>
> I am deploying with Network Isolation and 3 Controllers in HA.
>
> If I create a stack with 3 Controllers and 3 compute nodes this takes
> about 1 hour
> If I create a stack with 3 Controllers and 15 compute nodes this takes
> about 1 hour
> Both stacks pass Clapper validations.
>
> During deployment I can see that the first 20 to 30 mins is using all
> the bandwidth available for the overcloud image deployment and them
> uses hardly any bandwidth whilst the rest of the configuration takes
> place.
>
> So I try a stack with 40 nodes. This is where I have issues.
> I set the timeout to 4 hours and leave it over night to deploy.
> It seems to timeout and fail to deploy due to the timeout every time.
>
> During the 40 node deployment the overcloud image is distributed in
> about 45 mins to all nodes and the all nodes appear ACTIVE and have an
> IP address on the deployment network.
> So it would appear that the rest of the low bandwidth configuration is
> taking well over 3 hours to complete. This seems excessive
> I have configured nova.conf for deployment concurrency (from the
> tuning link above) and configured the heat.conf 'num_engine_workers'
> to be 32 taking in to account this bug
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1370516
>
> So my question is how do I tune my Undercloud to speed up the deployment?
>
> Looking at htop during deployment I can see heat is using many CPUs,
> but the work pattern is NOT distributed. What typically happens is all
> the CPUs are at 0 to 1 % used apart from one which is at 50 to 100%.
> This one CPU id changes regularly, but there is no concurrent
> distributed workload across all the CPUs that the heat processes are
> running on. Is heat really multi-threaded, or does if have limitations
> so it can only really do proper work on one CPU at a time (which I am
> seeing in htop)?
>
> Thanks
>
> Charles
>
>
>
--
Charles Short
Cloud Engineer
Virtualization and Cloud Team
European Bioinformatics Institute (EMBL-EBI)
Tel: +44 (0)1223 494205
More information about the dev
mailing list