[rdo-list] HA overcloud-deploy.sh crashes again ( ControllerOvercloudServicesDeployment_Step4 )

John Trowbridge trown at redhat.com
Thu Jun 30 17:47:02 UTC 2016



On 06/30/2016 12:56 PM, Boris Derzhavets wrote:
> 
> 
> 
> ________________________________
> From: John Trowbridge <trown at redhat.com>
> Sent: Thursday, June 30, 2016 10:14 AM
> To: Boris Derzhavets; Dan Sneddon; rdo-list at redhat.com
> Subject: Re: [rdo-list] HA overcloud-deploy.sh crashes again ( ControllerOvercloudServicesDeployment_Step4 )
> 
> 
> 
> On 06/30/2016 05:19 AM, Boris Derzhavets wrote:
>>
>>
>>
>> ________________________________
>> From: rdo-list-bounces at redhat.com <rdo-list-bounces at redhat.com> on behalf of Boris Derzhavets <bderzhavets at hotmail.com>
>> Sent: Wednesday, June 29, 2016 5:14 PM
>> To: Dan Sneddon; rdo-list at redhat.com
>> Subject: Re: [rdo-list] HA overcloud-deploy.sh crashes again ( ControllerOvercloudServicesDeployment_Step4 )
>>
>>  Yes , attempt to deploy
>>
>> ########################
>> #  HA +2xCompute
>> ########################
>> control_memory: 6144
>> compute_memory: 6144
>>
>> undercloud_memory: 8192
>>
>> # Giving the undercloud additional CPUs can greatly improve heat's
>> # performance (and result in a shorter deploy time).
>> undercloud_vcpu: 4
> 
> Increasing this without also increasing the memory on the undercloud
> will usually end in sadness, because more CPUs means more worker
> processes means more memory consumption. In general straying from the
> values in CI, is unlikely to work unless you have significantly better
> hardware than what runs in CI (32G hosts with decent CPU).
> 
>    It will be verified tomorrow with
>    undercloud_vcpu: 2
>    This test would be a fair . It will take about 2 hr.
>    But, I still believe that  it is not root cause  of issue with
>    Configuration - 3xController(HA) + 2xCompute  having :-
>    undercloud_memory: 8192
>    undercloud_vcpu: 4
>    which was tested many times OK since 06/05 up to 06/24
>    with no problems.

Just realized that you are also deploying 2x compute nodes. Just FYI,
even the basic HA setup barely fits on a 32G host. In fact on 3 of the 4
nodes in CI, we rarely get a pass of HA because the resources are so
tight. Will actually be switching that job to a single controller job
with pacemaker for exactly that reason (email to RDO list about that
will come later this afternoon).

How big is the virthost you are using?
> 
>   Thank you very much for feedback
>   Boris.
> 
> https://github.com/openstack/tripleo-quickstart/blob/master/config/general_config/ha.yml#L13
> 
> It is not 100% that is the root cause of your issue, as the logs below
> look like we hit issues either with Ironic deployment to the nodes, or
> some issue with Nova scheduler. Note, that is definitely a different
> problem (and possibly transient), than the one reported in the beginning
> of this thread.
> 
>>
>> # Create three controller nodes and one compute node.
>> overcloud_nodes:
>>   - name: control_0
>>     flavor: control
>>   - name: control_1
>>     flavor: control
>>   - name: control_2
>>     flavor: control
>>
>>   - name: compute_0
>>     flavor: compute
>>   - name: compute_1
>>     flavor: compute
>>
>> # We don't need introspection in a virtual environment (because we are
>> # creating all the "hardware" we really know the necessary
>> # information).
>> introspect: false
>>
>> # Tell tripleo about our environment.
>> network_isolation: true
>> extra_args: >-
>>   --control-scale 3 --compute-scale 2 --neutron-network-type vxlan
>>   --neutron-tunnel-types vxlan
>>   -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml
>>   --ntp-server pool.ntp.org
>> deploy_timeout: 75
>> tempest: false
>> pingtest: true
>>
>> Results during overcloud deployment :-
>>
>> 2016-06-30 09:09:31 [NovaCompute]: CREATE_FAILED ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500"
>> 2016-06-30 09:09:31 [NovaCompute]: DELETE_IN_PROGRESS state changed
>> 2016-06-30 09:09:34 [NovaCompute]: DELETE_COMPLETE state changed
>> 2016-06-30 09:09:44 [NovaCompute]: CREATE_IN_PROGRESS state changed
>> 2016-06-30 09:09:48 [NovaCompute]: CREATE_FAILED ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500"
>> . . . . .
>>
>> 2016-06-30 09:11:36 [overcloud]: CREATE_FAILED Resource CREATE failed: ResourceInError: resources.Compute.resources[0].resources.NovaCompute: Went to status ERROR due to "Message: Build of instance bf483c34-7010-48ea-8f58-fe192c91093f aborted: Failed to provision instance bf483c34-7010-48ea-8f58-fe192
>> 2016-06-30 09:11:36 [1]: SIGNAL_COMPLETE Unknown
>> 2016-06-30 09:11:36 [ControllerDeployment]: SIGNAL_COMPLETE Unknown
>> 2016-06-30 09:11:36 [1]: CREATE_COMPLETE state changed
>> 2016-06-30 09:11:36 [overcloud-ControllerCephDeployment-62xh7uhtpjqp]: CREATE_COMPLETE Stack CREATE completed successfully
>> 2016-06-30 09:11:37 [NetworkDeployment]: SIGNAL_COMPLETE Unknown
>> 2016-06-30 09:11:37 [1]: SIGNAL_COMPLETE Unknown
>> Stack overcloud CREATE_FAILED
>> Deployment failed:  Heat Stack create failed.
>> + heat stack-list
>> + grep -q CREATE_FAILED
>> + deploy_status=1
>> ++ heat resource-list --nested-depth 5 overcloud
>> ++ grep FAILED
>> ++ grep 'StructuredDeployment '
>> ++ cut -d '|' -f3
>> + exit 1
>>
>>
>> Thanks.
>>
>> Boris
>>
>>
>> ________________________________
>> From: rdo-list-bounces at redhat.com <rdo-list-bounces at redhat.com> on behalf of Dan Sneddon <dsneddon at redhat.com>
>> Sent: Wednesday, June 29, 2016 1:46 PM
>> To: rdo-list at redhat.com
>> Subject: Re: [rdo-list] HA overcloud-deploy.sh crashes again ( ControllerOvercloudServicesDeployment_Step4 )
>>
>> On 06/29/2016 10:42 AM, Dan Sneddon wrote:
>>> On 06/29/2016 07:03 AM, Boris Derzhavets wrote:
>>>> Boris Derzhavets has shared a OneDrive file with you. To view it, click
>>>> the link below.
>>>>
>>>> <https://1drv.ms/u/s!AqjiDzRpwaKogSHAekH8ZluOaclk>
>> [https://p.sfx.ms/icons/v2/Large/Default.png]<https://1drv.ms/u/s!AqjiDzRpwaKogSHAekH8ZluOaclk>
>>
>> HeatCrash2.txt 1.gz<https://1drv.ms/u/s!AqjiDzRpwaKogSHAekH8ZluOaclk>
>> 1drv.ms
>> GZ File
>>
>>
>>>>
>>>> HeatCrash2.txt 1.gz <https://1drv.ms/u/s!AqjiDzRpwaKogSHAekH8ZluOaclk>
>>>>       [HeatCrash2.txt 1.gz]
>>>>
>>>> Reattach gzip archive via One Drive
>>>>
>>>>
>>>>
>>>> -----------------------------------------------------------------------
>>>> *From:* rdo-list-bounces at redhat.com <rdo-list-bounces at redhat.com> on
>>>> behalf of Boris Derzhavets <bderzhavets at hotmail.com>
>>>> *Sent:* Wednesday, June 29, 2016 9:36 AM
>>>> *To:* John Trowbridge; shardy at redhat.com
>>>> *Cc:* rdo-list at redhat.com
>>>> *Subject:* [rdo-list] HA overcloud-deploy.sh crashes again (
>>>> ControllerOvercloudServicesDeployment_Step4 )
>>>>
>>>>
>>>> Attempt to follow steps suggested
>>>> in http://hardysteven.blogspot.ru/2016/06/tripleo-partial-stack-updates.html
>>>>
>>>>
>>>> ./deploy-overstack crashes
>>>>
>>>>
>>>> 2016-06-29 12:42:41
>>>> [overcloud-ControllerNodesPostDeployment-2r4tlv5icaxk-ControllerOvercloudServicesDeployment_Step4-nzdoizlgrmx2]:
>>>> CREATE_FAILED Resource CREATE failed: Error: resources[0]: Deployment
>>>> to server failed: deploy_status_code : Deployment exited with non-zero
>>>> status code: 6
>>>> 2016-06-29 12:42:42 [ControllerOvercloudServicesDeployment_Step4]:
>>>> CREATE_FAILED Error:
>>>> resources.ControllerOvercloudServicesDeployment_Step4.resources[0]:
>>>> Deployment to server failed: deploy_status_code: Deployment exited with
>>>> non-zero status code: 6
>>>> 2016-06-29 12:42:43
>>>> [overcloud-ControllerNodesPostDeployment-2r4tlv5icaxk]: CREATE_FAILED
>>>> Resource CREATE failed: Error:
>>>> resources.ControllerOvercloudServicesDeployment_Step4.resources[0]:
>>>> Deployment to server failed: deploy_status_code: Deployment exited with
>>>> non-zero status code: 6
>>>> 2016-06-29 12:42:44 [ControllerNodesPostDeployment]: CREATE_FAILED
>>>> Error:
>>>> resources.ControllerNodesPostDeployment.resources.ControllerOvercloudServicesDeployment_Step4.resources[0]:
>>>> Deployment to server failed: deploy_status_code: Deployment exited with
>>>> non-zero status code: 6
>>>> 2016-06-29 12:42:44 [2]: SIGNAL_COMPLETE Unknown
>>>> 2016-06-29 12:42:45 [2]: SIGNAL_COMPLETE Unknown
>>>> 2016-06-29 12:42:45 [2]: SIGNAL_COMPLETE Unknown
>>>> 2016-06-29 12:42:46 [overcloud]: CREATE_FAILED Resource CREATE failed:
>>>> Error:
>>>> resources.ControllerNodesPostDeployment.resources.ControllerOvercloudServicesDeployment_Step4.resources[0]:
>>>> Deployment to server failed: deploy_status_code: Deployment exited with
>>>> non-zero status code: 6
>>>> 2016-06-29 12:42:46 [2]: SIGNAL_COMPLETE Unknown
>>>> 2016-06-29 12:42:47 [2]: SIGNAL_COMPLETE Unknown
>>>> 2016-06-29 12:42:47 [ControllerDeployment]: SIGNAL_COMPLETE Unknown
>>>> 2016-06-29 12:42:48 [NetworkDeployment]: SIGNAL_COMPLETE Unknown
>>>> 2016-06-29 12:42:48 [2]: SIGNAL_COMPLETE Unknown
>>>> Stack overcloud CREATE_FAILED
>>>> Deployment failed:  Heat Stack create failed.
>>>> + heat stack-list
>>>> + grep -q CREATE_FAILED
>>>> + deploy_status=1
>>>> ++ heat resource-list --nested-depth 5 overcloud
>>>> ++ grep FAILED
>>>> ++ grep 'StructuredDeployment '
>>>> ++ cut -d '|' -f3
>>>> + for failed in '$(heat resource-list         --nested-depth 5
>>>> overcloud | grep FAILED |
>>>>         grep '\''StructuredDeployment '\'' | cut -d '\''|'\'' -f3)'
>>>> + heat deployment-show 655c77fc-6a78-4cca-b4b7-a153a3f4ad52
>>>> + for failed in '$(heat resource-list         --nested-depth 5
>>>> overcloud | grep FAILED |
>>>>         grep '\''StructuredDeployment '\'' | cut -d '\''|'\'' -f3)'
>>>> + heat deployment-show 1fe5153c-e017-4ee5-823a-3d1524430c1d
>>>> + for failed in '$(heat resource-list         --nested-depth 5
>>>> overcloud | grep FAILED |
>>>>         grep '\''StructuredDeployment '\'' | cut -d '\''|'\'' -f3)'
>>>> + heat deployment-show bf6f25f4-d812-41e9-a7a8-122de619a624
>>>> + exit 1
>>>>
>>>> *****************************
>>>> Troubleshooting steps :-
>>>> *****************************
>>>>
>>>> [stack at undercloud ~]$ . stackrc
>>>> [stack at undercloud ~]$  heat resource-list overcloud | grep
>>>> ControllerNodesPost
>>>> | ControllerNodesPostDeployment             |
>>>> f1d6a474-c946-46bf-ab0c-2fdaeb55d0b3          |
>>>> OS::TripleO::ControllerPostDeployment             | CREATE_FAILED   |
>>>> 2016-06-29T12:11:21 |
>>>>
>>>>
>>>> [stack at undercloud ~]$ heat stack-list -n | grep "^|
>>>> f1d6a474-c946-46bf-ab0c-2fdaeb55d0b3"
>>>> | f1d6a474-c946-46bf-ab0c-2fdaeb55d0b3 |
>>>> overcloud-ControllerNodesPostDeployment-2r4tlv5icaxk
>>>> | CREATE_FAILED   | 2016-06-29T12:31:11 | None         |
>>>> 17f82f6e-e0ca-44c6-9058-de82c00d4f79 |
>>>>
>>>>
>>>>
>>>> [stack at undercloud ~]$ heat event-list -m
>>>> f1d6a474-c946-46bf-ab0c-2fdaeb55d0b3
>>>> overcloud-ControllerNodesPostDeployment-2r4tlv5icaxk
>>>>
>>>> +------------------------------------------------------+--------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+---------------------+
>>>> | resource_name                                        |
>>>> id                                   |
>>>> resource_status_reason
>>>> | resource_status    | event_time          |
>>>> +------------------------------------------------------+--------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+---------------------+
>>>> | overcloud-ControllerNodesPostDeployment-2r4tlv5icaxk |
>>>> 10ec0cf9-b3c9-4191-9966-3f4d47f27e2a | Stack CREATE started
>>>> . . . . . . . . . . . . . . . . .
>>>> Step1,2,3 succeeded
>>>> . . . . . . . . . . . . . . . . .
>>>>
>>>> | CREATE_IN_PROGRESS | 2016-06-29T12:31:14 |
>>>> | ControllerPuppetConfig                               |
>>>> a2a1df33-5106-425c-b16d-8d2df709b19f | state
>>>> changed
>>>> | CREATE_COMPLETE    | 2016-06-29T12:35:02 |
>>>> | ControllerOvercloudServicesDeployment_Step4          |
>>>> 1e151333-4de5-4e7b-907c-ea0f42d31a47 | state
>>>> changed
>>>> | CREATE_IN_PROGRESS | 2016-06-29T12:35:03 |
>>>> | ControllerOvercloudServicesDeployment_Step4          |
>>>> 7bf36334-3d92-4554-b6c0-41294a072ab6 | Error:
>>>> resources.ControllerOvercloudServicesDeployment_Step4.resources[0]:
>>>> Deployment to server failed: deploy_status_code: Deployment exited with
>>>> non-zero status code: 6                         | CREATE_FAILED      |
>>>> 2016-06-29T12:42:42 |
>>>> | overcloud-ControllerNodesPostDeployment-2r4tlv5icaxk
>>>>  | e72fb6f4-c2aa-4fe8-9bd1-5f5ad152685c | Resource CREATE failed:
>>>> Error:
>>>> resources.ControllerOvercloudServicesDeployment_Step4.resources[0]:
>>>> Deployment to server failed: deploy_status_code: Deployment exited with
>>>> non-zero status code: 6 | CREATE_FAILED      | 2016-06-29T12:42:43 |
>>>> +------------------------------------------------------+--------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+---------------------+
>>>>
>>>> [stack at undercloud ~]$ heat stack-show
>>>> overcloud-ControllerNodesPostDeployment-2r4tlv5icaxk | grep
>>>> NodeConfigIdentifiers
>>>> |                       |   "NodeConfigIdentifiers":
>>>> "{u'deployment_identifier': 1467202276, u'controller_config': {u'1':
>>>> u'os-apply-config deployment 796df02a-7550-414b-a084-8b591a13e6db
>>>> completed,Root CA cert injection not enabled.,TLS not enabled.,None,',
>>>> u'0': u'os-apply-config deployment 613ec889-d852-470a-8e4c-6e243e1d2033
>>>> completed,Root CA cert injection not enabled.,TLS not enabled.,None,',
>>>> u'2': u'os-apply-config deployment c8b099d0-3af4-4ba0-a056-a0ce60f40e2d
>>>> completed,Root CA cert injection not enabled.,TLS not enabled.,None,'},
>>>> u'allnodes_extra': u'none'}" |
>>>>
>>>> However, when stack creating crashed update wouldn't help.
>>>>
>>>> [stack at undercloud ~]$ heat stack-update -x
>>>> overcloud-ControllerNodesPostDeployment-2r4tlv5icaxk   -e update_env.yaml
>>>> ERROR: PATCH update to non-COMPLETE stack is not supported.
>>>>
>>>> DUE TO :-
>>>>
>>>> [stack at undercloud ~]$ heat stack-list
>>>> +--------------------------------------+------------+---------------+---------------------+--------------+
>>>> | id                                   | stack_name | stack_status  |
>>>> creation_time       | updated_time |
>>>> +--------------------------------------+------------+---------------+---------------------+--------------+
>>>> | 17f82f6e-e0ca-44c6-9058-de82c00d4f79 | overcloud  | CREATE_FAILED |
>>>> 2016-06-29T12:11:20 | None         |
>>>> +--------------------------------------+------------+---------------+---------------------+------
>>>>
>>>>
>>>> Complete error file `heat deployment-show
>>>> 655c77fc-6a78-4cca-b4b7-a153a3f4ad52` is  attached a gzip archive.
>>>>
>>>>
>>>> Thanks.
>>>>
>>>> Boris.
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> rdo-list mailing list
>>>> rdo-list at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/rdo-list
>>>>
>>>> To unsubscribe: rdo-list-unsubscribe at redhat.com
>>>>
>>>
>>> The failure occurred during the post-deployment, which means that the
>>> initial deployment succeeded, but then the steps that are done to the
>>> completed overcloud failed.
>>>
>>> This is most commonly attributable to network problems between the
>>> Undercloud and the Overcloud Public API. The Undercloud needs to reach
>>> the Public API in order to do some of the post-configuration steps. If
>>> this API isn't reachable, you end up with the error you saw above.
>>>
>>> You can test this connectivity by pinging the Public API VIP from the
>>> Undercloud. Starting with the failed deployment, run "neutron
>>> port-list" against the Underlcloud and look for the IP on the port
>>> named "public_virtual_ip". You should be able to ping this address from
>>> the Undercloud. If you can't reach that IP, then you need to check the
>>> connectivity/routing between the Undercloud and the External network on
>>> the Overcloud.
>>>
>>
>> I should also mention common causes of this problem:
>>
>> * Incorrect value for ExternalInterfaceDefaultRoute in the network
>> environment file.
>> * Controllers do not have the default route on the External network in
>> the NIC config templates (required for reachability from remote subnets).
>> * Incorrect subnet mask on the ExternalNetCidr in the network environment.
>> * Incorrect ExternalAllocationPools values in the network environment.
>> * Incorrect Ethernet switch config for the Controllers.
>>
>>         Issue has been reproduced with exactly same error 4 times
>>         starting since 06/25/16 on daily basis with exactly same error at Step4
>>         of overcloud-ControllerNodesPostDeployment.
>>         In meantime I cannot reproduce the error.
>>         Config 3xNode HA Controller + 1xCompute  works .
>>         There was one more issue  3xNode HA Controller + 2xCompute
>>         failed   immediately when overcloud-deploy.sh started due to
>>         only 4 nodes could be introspected. I will test it tomorrow morning.
>>
>>         Thanks a lot.
>>         Boris.
>>
>> --
>> Dan Sneddon         |  Principal OpenStack Engineer
>> dsneddon at redhat.com |  redhat.com/openstack
>> 650.254.4025        |  dsneddon:irc   @dxs:twitter
>>
>> _______________________________________________
>> rdo-list mailing list
>> rdo-list at redhat.com
>> https://www.redhat.com/mailman/listinfo/rdo-list
>>
>> To unsubscribe: rdo-list-unsubscribe at redhat.com
>>
>>
>>
>> This body part will be downloaded on demand.
>>




More information about the dev mailing list