________________________________
From: Boris Derzhavets <bderzhavets(a)hotmail.com>
Sent: Thursday, June 30, 2016 2:17 PM
To: John Trowbridge; Dan Sneddon; rdo-list(a)redhat.com
Subject: Re: [rdo-list] HA overcloud-deploy.sh crashes again (
ControllerOvercloudServicesDeployment_Step4 )
________________________________
From: John Trowbridge <trown(a)redhat.com>
Sent: Thursday, June 30, 2016 1:47 PM
To: Boris Derzhavets; Dan Sneddon; rdo-list(a)redhat.com
Subject: Re: [rdo-list] HA overcloud-deploy.sh crashes again (
ControllerOvercloudServicesDeployment_Step4 )
On 06/30/2016 12:56 PM, Boris Derzhavets wrote:
________________________________
From: John Trowbridge <trown(a)redhat.com>
Sent: Thursday, June 30, 2016 10:14 AM
To: Boris Derzhavets; Dan Sneddon; rdo-list(a)redhat.com
Subject: Re: [rdo-list] HA overcloud-deploy.sh crashes again (
ControllerOvercloudServicesDeployment_Step4 )
On 06/30/2016 05:19 AM, Boris Derzhavets wrote:
>
>
>
> ________________________________
> From: rdo-list-bounces(a)redhat.com <rdo-list-bounces(a)redhat.com> on behalf of
Boris Derzhavets <bderzhavets(a)hotmail.com>
> Sent: Wednesday, June 29, 2016 5:14 PM
> To: Dan Sneddon; rdo-list(a)redhat.com
> Subject: Re: [rdo-list] HA overcloud-deploy.sh crashes again (
ControllerOvercloudServicesDeployment_Step4 )
>
> Yes , attempt to deploy
>
> ########################
> # HA +2xCompute
> ########################
> control_memory: 6144
> compute_memory: 6144
>
> undercloud_memory: 8192
>
> # Giving the undercloud additional CPUs can greatly improve heat's
> # performance (and result in a shorter deploy time).
> undercloud_vcpu: 4
Increasing this without also increasing the memory on the undercloud
will usually end in sadness, because more CPUs means more worker
processes means more memory consumption. In general straying from the
values in CI, is unlikely to work unless you have significantly better
hardware than what runs in CI (32G hosts with decent CPU).
It will be verified tomorrow with
undercloud_vcpu: 2
Problem with introspection is gone.
Config 3xController(HA) + 2xCompute maybe deployed with
undecloud_vcpu: 2
or
undecloud_vcpu: 4
doesn't matter.
Thank you.
Boris.
This test would be a fair . It will take about 2 hr.
But, I still believe that it is not root cause of issue with
Configuration - 3xController(HA) + 2xCompute having :-
undercloud_memory: 8192
undercloud_vcpu: 4
which was tested many times OK since 06/05 up to 06/24
with no problems.
Just realized that you are also deploying 2x compute nodes. Just FYI,
even the basic HA setup barely fits on a 32G host. In fact on 3 of the 4
nodes in CI, we rarely get a pass of HA because the resources are so
tight. Will actually be switching that job to a single controller job
with pacemaker for exactly that reason (email to RDO list about that
will come later this afternoon).
How big is the virthost you are using?
32 GB . I am currently at VIRTHOST Console and
made TOP's snapshot of running config
3xController(HA) + 1xCompute.
Snapshot is attached ( Disregard previous message, it was sent occasionally )
Config :-
3xController(HA) + 2xCompute.
I was monitoring via TOP many times .
I didn't see big problems with RAM
Current problem is most probably related with introspection nodes
supposed to built overcloud.
Thank you very much for feedback
Boris.
https://github.com/openstack/tripleo-quickstart/blob/master/config/genera...
It is not 100% that is the root cause of your issue, as the logs below
look like we hit issues either with Ironic deployment to the nodes, or
some issue with Nova scheduler. Note, that is definitely a different
problem (and possibly transient), than the one reported in the beginning
of this thread.
>
> # Create three controller nodes and one compute node.
> overcloud_nodes:
> - name: control_0
> flavor: control
> - name: control_1
> flavor: control
> - name: control_2
> flavor: control
>
> - name: compute_0
> flavor: compute
> - name: compute_1
> flavor: compute
>
> # We don't need introspection in a virtual environment (because we are
> # creating all the "hardware" we really know the necessary
> # information).
> introspect: false
>
> # Tell tripleo about our environment.
> network_isolation: true
> extra_args: >-
> --control-scale 3 --compute-scale 2 --neutron-network-type vxlan
> --neutron-tunnel-types vxlan
> -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml
> --ntp-server
pool.ntp.org
> deploy_timeout: 75
> tempest: false
> pingtest: true
>
> Results during overcloud deployment :-
>
> 2016-06-30 09:09:31 [NovaCompute]: CREATE_FAILED ResourceInError:
resources.NovaCompute: Went to status ERROR due to "Message: No valid host was found.
There are not enough hosts available., Code: 500"
> 2016-06-30 09:09:31 [NovaCompute]: DELETE_IN_PROGRESS state changed
> 2016-06-30 09:09:34 [NovaCompute]: DELETE_COMPLETE state changed
> 2016-06-30 09:09:44 [NovaCompute]: CREATE_IN_PROGRESS state changed
> 2016-06-30 09:09:48 [NovaCompute]: CREATE_FAILED ResourceInError:
resources.NovaCompute: Went to status ERROR due to "Message: No valid host was found.
There are not enough hosts available., Code: 500"
> . . . . .
>
> 2016-06-30 09:11:36 [overcloud]: CREATE_FAILED Resource CREATE failed:
ResourceInError: resources.Compute.resources[0].resources.NovaCompute: Went to status
ERROR due to "Message: Build of instance bf483c34-7010-48ea-8f58-fe192c91093f
aborted: Failed to provision instance bf483c34-7010-48ea-8f58-fe192
> 2016-06-30 09:11:36 [1]: SIGNAL_COMPLETE Unknown
> 2016-06-30 09:11:36 [ControllerDeployment]: SIGNAL_COMPLETE Unknown
> 2016-06-30 09:11:36 [1]: CREATE_COMPLETE state changed
> 2016-06-30 09:11:36 [overcloud-ControllerCephDeployment-62xh7uhtpjqp]:
CREATE_COMPLETE Stack CREATE completed successfully
> 2016-06-30 09:11:37 [NetworkDeployment]: SIGNAL_COMPLETE Unknown
> 2016-06-30 09:11:37 [1]: SIGNAL_COMPLETE Unknown
> Stack overcloud CREATE_FAILED
> Deployment failed: Heat Stack create failed.
> + heat stack-list
> + grep -q CREATE_FAILED
> + deploy_status=1
> ++ heat resource-list --nested-depth 5 overcloud
> ++ grep FAILED
> ++ grep 'StructuredDeployment '
> ++ cut -d '|' -f3
> + exit 1
>
>
> Thanks.
>
> Boris
>
>
> ________________________________
> From: rdo-list-bounces(a)redhat.com <rdo-list-bounces(a)redhat.com> on behalf of
Dan Sneddon <dsneddon(a)redhat.com>
> Sent: Wednesday, June 29, 2016 1:46 PM
> To: rdo-list(a)redhat.com
> Subject: Re: [rdo-list] HA overcloud-deploy.sh crashes again (
ControllerOvercloudServicesDeployment_Step4 )
>
> On 06/29/2016 10:42 AM, Dan Sneddon wrote:
>> On 06/29/2016 07:03 AM, Boris Derzhavets wrote:
>>> Boris Derzhavets has shared a OneDrive file with you. To view it, click
>>> the link below.
>>>
>>> <
https://1drv.ms/u/s!AqjiDzRpwaKogSHAekH8ZluOaclk>
>
[
https://p.sfx.ms/icons/v2/Large/Default.png]<https://1drv.ms/u/s!AqjiD...
>
> HeatCrash2.txt 1.gz<https://1drv.ms/u/s!AqjiDzRpwaKogSHAekH8ZluOaclk>
> 1drv.ms
> GZ File
>
>
>>>
>>> HeatCrash2.txt 1.gz
<
https://1drv.ms/u/s!AqjiDzRpwaKogSHAekH8ZluOaclk>
>>> [HeatCrash2.txt 1.gz]
>>>
>>> Reattach gzip archive via One Drive
>>>
>>>
>>>
>>> -----------------------------------------------------------------------
>>> *From:* rdo-list-bounces(a)redhat.com <rdo-list-bounces(a)redhat.com> on
>>> behalf of Boris Derzhavets <bderzhavets(a)hotmail.com>
>>> *Sent:* Wednesday, June 29, 2016 9:36 AM
>>> *To:* John Trowbridge; shardy(a)redhat.com
>>> *Cc:* rdo-list(a)redhat.com
>>> *Subject:* [rdo-list] HA overcloud-deploy.sh crashes again (
>>> ControllerOvercloudServicesDeployment_Step4 )
>>>
>>>
>>> Attempt to follow steps suggested
>>> in
http://hardysteven.blogspot.ru/2016/06/tripleo-partial-stack-updates.html
>>>
>>>
>>> ./deploy-overstack crashes
>>>
>>>
>>> 2016-06-29 12:42:41
>>>
[overcloud-ControllerNodesPostDeployment-2r4tlv5icaxk-ControllerOvercloudServicesDeployment_Step4-nzdoizlgrmx2]:
>>> CREATE_FAILED Resource CREATE failed: Error: resources[0]: Deployment
>>> to server failed: deploy_status_code : Deployment exited with non-zero
>>> status code: 6
>>> 2016-06-29 12:42:42 [ControllerOvercloudServicesDeployment_Step4]:
>>> CREATE_FAILED Error:
>>> resources.ControllerOvercloudServicesDeployment_Step4.resources[0]:
>>> Deployment to server failed: deploy_status_code: Deployment exited with
>>> non-zero status code: 6
>>> 2016-06-29 12:42:43
>>> [overcloud-ControllerNodesPostDeployment-2r4tlv5icaxk]: CREATE_FAILED
>>> Resource CREATE failed: Error:
>>> resources.ControllerOvercloudServicesDeployment_Step4.resources[0]:
>>> Deployment to server failed: deploy_status_code: Deployment exited with
>>> non-zero status code: 6
>>> 2016-06-29 12:42:44 [ControllerNodesPostDeployment]: CREATE_FAILED
>>> Error:
>>>
resources.ControllerNodesPostDeployment.resources.ControllerOvercloudServicesDeployment_Step4.resources[0]:
>>> Deployment to server failed: deploy_status_code: Deployment exited with
>>> non-zero status code: 6
>>> 2016-06-29 12:42:44 [2]: SIGNAL_COMPLETE Unknown
>>> 2016-06-29 12:42:45 [2]: SIGNAL_COMPLETE Unknown
>>> 2016-06-29 12:42:45 [2]: SIGNAL_COMPLETE Unknown
>>> 2016-06-29 12:42:46 [overcloud]: CREATE_FAILED Resource CREATE failed:
>>> Error:
>>>
resources.ControllerNodesPostDeployment.resources.ControllerOvercloudServicesDeployment_Step4.resources[0]:
>>> Deployment to server failed: deploy_status_code: Deployment exited with
>>> non-zero status code: 6
>>> 2016-06-29 12:42:46 [2]: SIGNAL_COMPLETE Unknown
>>> 2016-06-29 12:42:47 [2]: SIGNAL_COMPLETE Unknown
>>> 2016-06-29 12:42:47 [ControllerDeployment]: SIGNAL_COMPLETE Unknown
>>> 2016-06-29 12:42:48 [NetworkDeployment]: SIGNAL_COMPLETE Unknown
>>> 2016-06-29 12:42:48 [2]: SIGNAL_COMPLETE Unknown
>>> Stack overcloud CREATE_FAILED
>>> Deployment failed: Heat Stack create failed.
>>> + heat stack-list
>>> + grep -q CREATE_FAILED
>>> + deploy_status=1
>>> ++ heat resource-list --nested-depth 5 overcloud
>>> ++ grep FAILED
>>> ++ grep 'StructuredDeployment '
>>> ++ cut -d '|' -f3
>>> + for failed in '$(heat resource-list --nested-depth 5
>>> overcloud | grep FAILED |
>>> grep '\''StructuredDeployment '\'' | cut -d
'\''|'\'' -f3)'
>>> + heat deployment-show 655c77fc-6a78-4cca-b4b7-a153a3f4ad52
>>> + for failed in '$(heat resource-list --nested-depth 5
>>> overcloud | grep FAILED |
>>> grep '\''StructuredDeployment '\'' | cut -d
'\''|'\'' -f3)'
>>> + heat deployment-show 1fe5153c-e017-4ee5-823a-3d1524430c1d
>>> + for failed in '$(heat resource-list --nested-depth 5
>>> overcloud | grep FAILED |
>>> grep '\''StructuredDeployment '\'' | cut -d
'\''|'\'' -f3)'
>>> + heat deployment-show bf6f25f4-d812-41e9-a7a8-122de619a624
>>> + exit 1
>>>
>>> *****************************
>>> Troubleshooting steps :-
>>> *****************************
>>>
>>> [stack@undercloud ~]$ . stackrc
>>> [stack@undercloud ~]$ heat resource-list overcloud | grep
>>> ControllerNodesPost
>>> | ControllerNodesPostDeployment |
>>> f1d6a474-c946-46bf-ab0c-2fdaeb55d0b3 |
>>> OS::TripleO::ControllerPostDeployment | CREATE_FAILED |
>>> 2016-06-29T12:11:21 |
>>>
>>>
>>> [stack@undercloud ~]$ heat stack-list -n | grep "^|
>>> f1d6a474-c946-46bf-ab0c-2fdaeb55d0b3"
>>> | f1d6a474-c946-46bf-ab0c-2fdaeb55d0b3 |
>>> overcloud-ControllerNodesPostDeployment-2r4tlv5icaxk
>>> | CREATE_FAILED | 2016-06-29T12:31:11 | None |
>>> 17f82f6e-e0ca-44c6-9058-de82c00d4f79 |
>>>
>>>
>>>
>>> [stack@undercloud ~]$ heat event-list -m
>>> f1d6a474-c946-46bf-ab0c-2fdaeb55d0b3
>>> overcloud-ControllerNodesPostDeployment-2r4tlv5icaxk
>>>
>>>
+------------------------------------------------------+--------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+---------------------+
>>> | resource_name |
>>> id |
>>> resource_status_reason
>>> | resource_status | event_time |
>>>
+------------------------------------------------------+--------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+---------------------+
>>> | overcloud-ControllerNodesPostDeployment-2r4tlv5icaxk |
>>> 10ec0cf9-b3c9-4191-9966-3f4d47f27e2a | Stack CREATE started
>>> . . . . . . . . . . . . . . . . .
>>> Step1,2,3 succeeded
>>> . . . . . . . . . . . . . . . . .
>>>
>>> | CREATE_IN_PROGRESS | 2016-06-29T12:31:14 |
>>> | ControllerPuppetConfig |
>>> a2a1df33-5106-425c-b16d-8d2df709b19f | state
>>> changed
>>> | CREATE_COMPLETE | 2016-06-29T12:35:02 |
>>> | ControllerOvercloudServicesDeployment_Step4 |
>>> 1e151333-4de5-4e7b-907c-ea0f42d31a47 | state
>>> changed
>>> | CREATE_IN_PROGRESS | 2016-06-29T12:35:03 |
>>> | ControllerOvercloudServicesDeployment_Step4 |
>>> 7bf36334-3d92-4554-b6c0-41294a072ab6 | Error:
>>> resources.ControllerOvercloudServicesDeployment_Step4.resources[0]:
>>> Deployment to server failed: deploy_status_code: Deployment exited with
>>> non-zero status code: 6 | CREATE_FAILED |
>>> 2016-06-29T12:42:42 |
>>> | overcloud-ControllerNodesPostDeployment-2r4tlv5icaxk
>>> | e72fb6f4-c2aa-4fe8-9bd1-5f5ad152685c | Resource CREATE failed:
>>> Error:
>>> resources.ControllerOvercloudServicesDeployment_Step4.resources[0]:
>>> Deployment to server failed: deploy_status_code: Deployment exited with
>>> non-zero status code: 6 | CREATE_FAILED | 2016-06-29T12:42:43 |
>>>
+------------------------------------------------------+--------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+---------------------+
>>>
>>> [stack@undercloud ~]$ heat stack-show
>>> overcloud-ControllerNodesPostDeployment-2r4tlv5icaxk | grep
>>> NodeConfigIdentifiers
>>> | | "NodeConfigIdentifiers":
>>> "{u'deployment_identifier': 1467202276,
u'controller_config': {u'1':
>>> u'os-apply-config deployment 796df02a-7550-414b-a084-8b591a13e6db
>>> completed,Root CA cert injection not enabled.,TLS not enabled.,None,',
>>> u'0': u'os-apply-config deployment
613ec889-d852-470a-8e4c-6e243e1d2033
>>> completed,Root CA cert injection not enabled.,TLS not enabled.,None,',
>>> u'2': u'os-apply-config deployment
c8b099d0-3af4-4ba0-a056-a0ce60f40e2d
>>> completed,Root CA cert injection not enabled.,TLS not enabled.,None,'},
>>> u'allnodes_extra': u'none'}" |
>>>
>>> However, when stack creating crashed update wouldn't help.
>>>
>>> [stack@undercloud ~]$ heat stack-update -x
>>> overcloud-ControllerNodesPostDeployment-2r4tlv5icaxk -e update_env.yaml
>>> ERROR: PATCH update to non-COMPLETE stack is not supported.
>>>
>>> DUE TO :-
>>>
>>> [stack@undercloud ~]$ heat stack-list
>>>
+--------------------------------------+------------+---------------+---------------------+--------------+
>>> | id | stack_name | stack_status |
>>> creation_time | updated_time |
>>>
+--------------------------------------+------------+---------------+---------------------+--------------+
>>> | 17f82f6e-e0ca-44c6-9058-de82c00d4f79 | overcloud | CREATE_FAILED |
>>> 2016-06-29T12:11:20 | None |
>>>
+--------------------------------------+------------+---------------+---------------------+------
>>>
>>>
>>> Complete error file `heat deployment-show
>>> 655c77fc-6a78-4cca-b4b7-a153a3f4ad52` is attached a gzip archive.
>>>
>>>
>>> Thanks.
>>>
>>> Boris.
>>>
>>>
>>>
>>> _______________________________________________
>>> rdo-list mailing list
>>> rdo-list(a)redhat.com
>>>
https://www.redhat.com/mailman/listinfo/rdo-list
>>>
>>> To unsubscribe: rdo-list-unsubscribe(a)redhat.com
>>>
>>
>> The failure occurred during the post-deployment, which means that the
>> initial deployment succeeded, but then the steps that are done to the
>> completed overcloud failed.
>>
>> This is most commonly attributable to network problems between the
>> Undercloud and the Overcloud Public API. The Undercloud needs to reach
>> the Public API in order to do some of the post-configuration steps. If
>> this API isn't reachable, you end up with the error you saw above.
>>
>> You can test this connectivity by pinging the Public API VIP from the
>> Undercloud. Starting with the failed deployment, run "neutron
>> port-list" against the Underlcloud and look for the IP on the port
>> named "public_virtual_ip". You should be able to ping this address
from
>> the Undercloud. If you can't reach that IP, then you need to check the
>> connectivity/routing between the Undercloud and the External network on
>> the Overcloud.
>>
>
> I should also mention common causes of this problem:
>
> * Incorrect value for ExternalInterfaceDefaultRoute in the network
> environment file.
> * Controllers do not have the default route on the External network in
> the NIC config templates (required for reachability from remote subnets).
> * Incorrect subnet mask on the ExternalNetCidr in the network environment.
> * Incorrect ExternalAllocationPools values in the network environment.
> * Incorrect Ethernet switch config for the Controllers.
>
> Issue has been reproduced with exactly same error 4 times
> starting since 06/25/16 on daily basis with exactly same error at Step4
> of overcloud-ControllerNodesPostDeployment.
> In meantime I cannot reproduce the error.
> Config 3xNode HA Controller + 1xCompute works .
> There was one more issue 3xNode HA Controller + 2xCompute
> failed immediately when overcloud-deploy.sh started due to
> only 4 nodes could be introspected. I will test it tomorrow morning.
>
> Thanks a lot.
> Boris.
>
> --
> Dan Sneddon | Principal OpenStack Engineer
> dsneddon(a)redhat.com |
redhat.com/openstack
> 650.254.4025 | dsneddon:irc @dxs:twitter
>
> _______________________________________________
> rdo-list mailing list
> rdo-list(a)redhat.com
>
https://www.redhat.com/mailman/listinfo/rdo-list
>
> To unsubscribe: rdo-list-unsubscribe(a)redhat.com
>
>
>
> This body part will be downloaded on demand.
>