[rdo-list] HA overcloud-deploy.sh crashes again ( ControllerOvercloudServicesDeployment_Step4 )

Boris Derzhavets bderzhavets at hotmail.com
Thu Jun 30 16:56:44 UTC 2016




________________________________
From: John Trowbridge <trown at redhat.com>
Sent: Thursday, June 30, 2016 10:14 AM
To: Boris Derzhavets; Dan Sneddon; rdo-list at redhat.com
Subject: Re: [rdo-list] HA overcloud-deploy.sh crashes again ( ControllerOvercloudServicesDeployment_Step4 )



On 06/30/2016 05:19 AM, Boris Derzhavets wrote:
>
>
>
> ________________________________
> From: rdo-list-bounces at redhat.com <rdo-list-bounces at redhat.com> on behalf of Boris Derzhavets <bderzhavets at hotmail.com>
> Sent: Wednesday, June 29, 2016 5:14 PM
> To: Dan Sneddon; rdo-list at redhat.com
> Subject: Re: [rdo-list] HA overcloud-deploy.sh crashes again ( ControllerOvercloudServicesDeployment_Step4 )
>
>  Yes , attempt to deploy
>
> ########################
> #  HA +2xCompute
> ########################
> control_memory: 6144
> compute_memory: 6144
>
> undercloud_memory: 8192
>
> # Giving the undercloud additional CPUs can greatly improve heat's
> # performance (and result in a shorter deploy time).
> undercloud_vcpu: 4

Increasing this without also increasing the memory on the undercloud
will usually end in sadness, because more CPUs means more worker
processes means more memory consumption. In general straying from the
values in CI, is unlikely to work unless you have significantly better
hardware than what runs in CI (32G hosts with decent CPU).

   It will be verified tomorrow with
   undercloud_vcpu: 2
   This test would be a fair . It will take about 2 hr.
   But, I still believe that  it is not root cause  of issue with
   Configuration - 3xController(HA) + 2xCompute  having :-
   undercloud_memory: 8192
   undercloud_vcpu: 4
   which was tested many times OK since 06/05 up to 06/24
   with no problems.

  Thank you very much for feedback
  Boris.

https://github.com/openstack/tripleo-quickstart/blob/master/config/general_config/ha.yml#L13

It is not 100% that is the root cause of your issue, as the logs below
look like we hit issues either with Ironic deployment to the nodes, or
some issue with Nova scheduler. Note, that is definitely a different
problem (and possibly transient), than the one reported in the beginning
of this thread.

>
> # Create three controller nodes and one compute node.
> overcloud_nodes:
>   - name: control_0
>     flavor: control
>   - name: control_1
>     flavor: control
>   - name: control_2
>     flavor: control
>
>   - name: compute_0
>     flavor: compute
>   - name: compute_1
>     flavor: compute
>
> # We don't need introspection in a virtual environment (because we are
> # creating all the "hardware" we really know the necessary
> # information).
> introspect: false
>
> # Tell tripleo about our environment.
> network_isolation: true
> extra_args: >-
>   --control-scale 3 --compute-scale 2 --neutron-network-type vxlan
>   --neutron-tunnel-types vxlan
>   -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml
>   --ntp-server pool.ntp.org
> deploy_timeout: 75
> tempest: false
> pingtest: true
>
> Results during overcloud deployment :-
>
> 2016-06-30 09:09:31 [NovaCompute]: CREATE_FAILED ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500"
> 2016-06-30 09:09:31 [NovaCompute]: DELETE_IN_PROGRESS state changed
> 2016-06-30 09:09:34 [NovaCompute]: DELETE_COMPLETE state changed
> 2016-06-30 09:09:44 [NovaCompute]: CREATE_IN_PROGRESS state changed
> 2016-06-30 09:09:48 [NovaCompute]: CREATE_FAILED ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500"
> . . . . .
>
> 2016-06-30 09:11:36 [overcloud]: CREATE_FAILED Resource CREATE failed: ResourceInError: resources.Compute.resources[0].resources.NovaCompute: Went to status ERROR due to "Message: Build of instance bf483c34-7010-48ea-8f58-fe192c91093f aborted: Failed to provision instance bf483c34-7010-48ea-8f58-fe192
> 2016-06-30 09:11:36 [1]: SIGNAL_COMPLETE Unknown
> 2016-06-30 09:11:36 [ControllerDeployment]: SIGNAL_COMPLETE Unknown
> 2016-06-30 09:11:36 [1]: CREATE_COMPLETE state changed
> 2016-06-30 09:11:36 [overcloud-ControllerCephDeployment-62xh7uhtpjqp]: CREATE_COMPLETE Stack CREATE completed successfully
> 2016-06-30 09:11:37 [NetworkDeployment]: SIGNAL_COMPLETE Unknown
> 2016-06-30 09:11:37 [1]: SIGNAL_COMPLETE Unknown
> Stack overcloud CREATE_FAILED
> Deployment failed:  Heat Stack create failed.
> + heat stack-list
> + grep -q CREATE_FAILED
> + deploy_status=1
> ++ heat resource-list --nested-depth 5 overcloud
> ++ grep FAILED
> ++ grep 'StructuredDeployment '
> ++ cut -d '|' -f3
> + exit 1
>
>
> Thanks.
>
> Boris
>
>
> ________________________________
> From: rdo-list-bounces at redhat.com <rdo-list-bounces at redhat.com> on behalf of Dan Sneddon <dsneddon at redhat.com>
> Sent: Wednesday, June 29, 2016 1:46 PM
> To: rdo-list at redhat.com
> Subject: Re: [rdo-list] HA overcloud-deploy.sh crashes again ( ControllerOvercloudServicesDeployment_Step4 )
>
> On 06/29/2016 10:42 AM, Dan Sneddon wrote:
>> On 06/29/2016 07:03 AM, Boris Derzhavets wrote:
>>> Boris Derzhavets has shared a OneDrive file with you. To view it, click
>>> the link below.
>>>
>>> <https://1drv.ms/u/s!AqjiDzRpwaKogSHAekH8ZluOaclk>
> [https://p.sfx.ms/icons/v2/Large/Default.png]<https://1drv.ms/u/s!AqjiDzRpwaKogSHAekH8ZluOaclk>
>
> HeatCrash2.txt 1.gz<https://1drv.ms/u/s!AqjiDzRpwaKogSHAekH8ZluOaclk>
> 1drv.ms
> GZ File
>
>
>>>
>>> HeatCrash2.txt 1.gz <https://1drv.ms/u/s!AqjiDzRpwaKogSHAekH8ZluOaclk>
>>>       [HeatCrash2.txt 1.gz]
>>>
>>> Reattach gzip archive via One Drive
>>>
>>>
>>>
>>> -----------------------------------------------------------------------
>>> *From:* rdo-list-bounces at redhat.com <rdo-list-bounces at redhat.com> on
>>> behalf of Boris Derzhavets <bderzhavets at hotmail.com>
>>> *Sent:* Wednesday, June 29, 2016 9:36 AM
>>> *To:* John Trowbridge; shardy at redhat.com
>>> *Cc:* rdo-list at redhat.com
>>> *Subject:* [rdo-list] HA overcloud-deploy.sh crashes again (
>>> ControllerOvercloudServicesDeployment_Step4 )
>>>
>>>
>>> Attempt to follow steps suggested
>>> in http://hardysteven.blogspot.ru/2016/06/tripleo-partial-stack-updates.html
>>>
>>>
>>> ./deploy-overstack crashes
>>>
>>>
>>> 2016-06-29 12:42:41
>>> [overcloud-ControllerNodesPostDeployment-2r4tlv5icaxk-ControllerOvercloudServicesDeployment_Step4-nzdoizlgrmx2]:
>>> CREATE_FAILED Resource CREATE failed: Error: resources[0]: Deployment
>>> to server failed: deploy_status_code : Deployment exited with non-zero
>>> status code: 6
>>> 2016-06-29 12:42:42 [ControllerOvercloudServicesDeployment_Step4]:
>>> CREATE_FAILED Error:
>>> resources.ControllerOvercloudServicesDeployment_Step4.resources[0]:
>>> Deployment to server failed: deploy_status_code: Deployment exited with
>>> non-zero status code: 6
>>> 2016-06-29 12:42:43
>>> [overcloud-ControllerNodesPostDeployment-2r4tlv5icaxk]: CREATE_FAILED
>>> Resource CREATE failed: Error:
>>> resources.ControllerOvercloudServicesDeployment_Step4.resources[0]:
>>> Deployment to server failed: deploy_status_code: Deployment exited with
>>> non-zero status code: 6
>>> 2016-06-29 12:42:44 [ControllerNodesPostDeployment]: CREATE_FAILED
>>> Error:
>>> resources.ControllerNodesPostDeployment.resources.ControllerOvercloudServicesDeployment_Step4.resources[0]:
>>> Deployment to server failed: deploy_status_code: Deployment exited with
>>> non-zero status code: 6
>>> 2016-06-29 12:42:44 [2]: SIGNAL_COMPLETE Unknown
>>> 2016-06-29 12:42:45 [2]: SIGNAL_COMPLETE Unknown
>>> 2016-06-29 12:42:45 [2]: SIGNAL_COMPLETE Unknown
>>> 2016-06-29 12:42:46 [overcloud]: CREATE_FAILED Resource CREATE failed:
>>> Error:
>>> resources.ControllerNodesPostDeployment.resources.ControllerOvercloudServicesDeployment_Step4.resources[0]:
>>> Deployment to server failed: deploy_status_code: Deployment exited with
>>> non-zero status code: 6
>>> 2016-06-29 12:42:46 [2]: SIGNAL_COMPLETE Unknown
>>> 2016-06-29 12:42:47 [2]: SIGNAL_COMPLETE Unknown
>>> 2016-06-29 12:42:47 [ControllerDeployment]: SIGNAL_COMPLETE Unknown
>>> 2016-06-29 12:42:48 [NetworkDeployment]: SIGNAL_COMPLETE Unknown
>>> 2016-06-29 12:42:48 [2]: SIGNAL_COMPLETE Unknown
>>> Stack overcloud CREATE_FAILED
>>> Deployment failed:  Heat Stack create failed.
>>> + heat stack-list
>>> + grep -q CREATE_FAILED
>>> + deploy_status=1
>>> ++ heat resource-list --nested-depth 5 overcloud
>>> ++ grep FAILED
>>> ++ grep 'StructuredDeployment '
>>> ++ cut -d '|' -f3
>>> + for failed in '$(heat resource-list         --nested-depth 5
>>> overcloud | grep FAILED |
>>>         grep '\''StructuredDeployment '\'' | cut -d '\''|'\'' -f3)'
>>> + heat deployment-show 655c77fc-6a78-4cca-b4b7-a153a3f4ad52
>>> + for failed in '$(heat resource-list         --nested-depth 5
>>> overcloud | grep FAILED |
>>>         grep '\''StructuredDeployment '\'' | cut -d '\''|'\'' -f3)'
>>> + heat deployment-show 1fe5153c-e017-4ee5-823a-3d1524430c1d
>>> + for failed in '$(heat resource-list         --nested-depth 5
>>> overcloud | grep FAILED |
>>>         grep '\''StructuredDeployment '\'' | cut -d '\''|'\'' -f3)'
>>> + heat deployment-show bf6f25f4-d812-41e9-a7a8-122de619a624
>>> + exit 1
>>>
>>> *****************************
>>> Troubleshooting steps :-
>>> *****************************
>>>
>>> [stack at undercloud ~]$ . stackrc
>>> [stack at undercloud ~]$  heat resource-list overcloud | grep
>>> ControllerNodesPost
>>> | ControllerNodesPostDeployment             |
>>> f1d6a474-c946-46bf-ab0c-2fdaeb55d0b3          |
>>> OS::TripleO::ControllerPostDeployment             | CREATE_FAILED   |
>>> 2016-06-29T12:11:21 |
>>>
>>>
>>> [stack at undercloud ~]$ heat stack-list -n | grep "^|
>>> f1d6a474-c946-46bf-ab0c-2fdaeb55d0b3"
>>> | f1d6a474-c946-46bf-ab0c-2fdaeb55d0b3 |
>>> overcloud-ControllerNodesPostDeployment-2r4tlv5icaxk
>>> | CREATE_FAILED   | 2016-06-29T12:31:11 | None         |
>>> 17f82f6e-e0ca-44c6-9058-de82c00d4f79 |
>>>
>>>
>>>
>>> [stack at undercloud ~]$ heat event-list -m
>>> f1d6a474-c946-46bf-ab0c-2fdaeb55d0b3
>>> overcloud-ControllerNodesPostDeployment-2r4tlv5icaxk
>>>
>>> +------------------------------------------------------+--------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+---------------------+
>>> | resource_name                                        |
>>> id                                   |
>>> resource_status_reason
>>> | resource_status    | event_time          |
>>> +------------------------------------------------------+--------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+---------------------+
>>> | overcloud-ControllerNodesPostDeployment-2r4tlv5icaxk |
>>> 10ec0cf9-b3c9-4191-9966-3f4d47f27e2a | Stack CREATE started
>>> . . . . . . . . . . . . . . . . .
>>> Step1,2,3 succeeded
>>> . . . . . . . . . . . . . . . . .
>>>
>>> | CREATE_IN_PROGRESS | 2016-06-29T12:31:14 |
>>> | ControllerPuppetConfig                               |
>>> a2a1df33-5106-425c-b16d-8d2df709b19f | state
>>> changed
>>> | CREATE_COMPLETE    | 2016-06-29T12:35:02 |
>>> | ControllerOvercloudServicesDeployment_Step4          |
>>> 1e151333-4de5-4e7b-907c-ea0f42d31a47 | state
>>> changed
>>> | CREATE_IN_PROGRESS | 2016-06-29T12:35:03 |
>>> | ControllerOvercloudServicesDeployment_Step4          |
>>> 7bf36334-3d92-4554-b6c0-41294a072ab6 | Error:
>>> resources.ControllerOvercloudServicesDeployment_Step4.resources[0]:
>>> Deployment to server failed: deploy_status_code: Deployment exited with
>>> non-zero status code: 6                         | CREATE_FAILED      |
>>> 2016-06-29T12:42:42 |
>>> | overcloud-ControllerNodesPostDeployment-2r4tlv5icaxk
>>>  | e72fb6f4-c2aa-4fe8-9bd1-5f5ad152685c | Resource CREATE failed:
>>> Error:
>>> resources.ControllerOvercloudServicesDeployment_Step4.resources[0]:
>>> Deployment to server failed: deploy_status_code: Deployment exited with
>>> non-zero status code: 6 | CREATE_FAILED      | 2016-06-29T12:42:43 |
>>> +------------------------------------------------------+--------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+---------------------+
>>>
>>> [stack at undercloud ~]$ heat stack-show
>>> overcloud-ControllerNodesPostDeployment-2r4tlv5icaxk | grep
>>> NodeConfigIdentifiers
>>> |                       |   "NodeConfigIdentifiers":
>>> "{u'deployment_identifier': 1467202276, u'controller_config': {u'1':
>>> u'os-apply-config deployment 796df02a-7550-414b-a084-8b591a13e6db
>>> completed,Root CA cert injection not enabled.,TLS not enabled.,None,',
>>> u'0': u'os-apply-config deployment 613ec889-d852-470a-8e4c-6e243e1d2033
>>> completed,Root CA cert injection not enabled.,TLS not enabled.,None,',
>>> u'2': u'os-apply-config deployment c8b099d0-3af4-4ba0-a056-a0ce60f40e2d
>>> completed,Root CA cert injection not enabled.,TLS not enabled.,None,'},
>>> u'allnodes_extra': u'none'}" |
>>>
>>> However, when stack creating crashed update wouldn't help.
>>>
>>> [stack at undercloud ~]$ heat stack-update -x
>>> overcloud-ControllerNodesPostDeployment-2r4tlv5icaxk   -e update_env.yaml
>>> ERROR: PATCH update to non-COMPLETE stack is not supported.
>>>
>>> DUE TO :-
>>>
>>> [stack at undercloud ~]$ heat stack-list
>>> +--------------------------------------+------------+---------------+---------------------+--------------+
>>> | id                                   | stack_name | stack_status  |
>>> creation_time       | updated_time |
>>> +--------------------------------------+------------+---------------+---------------------+--------------+
>>> | 17f82f6e-e0ca-44c6-9058-de82c00d4f79 | overcloud  | CREATE_FAILED |
>>> 2016-06-29T12:11:20 | None         |
>>> +--------------------------------------+------------+---------------+---------------------+------
>>>
>>>
>>> Complete error file `heat deployment-show
>>> 655c77fc-6a78-4cca-b4b7-a153a3f4ad52` is  attached a gzip archive.
>>>
>>>
>>> Thanks.
>>>
>>> Boris.
>>>
>>>
>>>
>>> _______________________________________________
>>> rdo-list mailing list
>>> rdo-list at redhat.com
>>> https://www.redhat.com/mailman/listinfo/rdo-list
>>>
>>> To unsubscribe: rdo-list-unsubscribe at redhat.com
>>>
>>
>> The failure occurred during the post-deployment, which means that the
>> initial deployment succeeded, but then the steps that are done to the
>> completed overcloud failed.
>>
>> This is most commonly attributable to network problems between the
>> Undercloud and the Overcloud Public API. The Undercloud needs to reach
>> the Public API in order to do some of the post-configuration steps. If
>> this API isn't reachable, you end up with the error you saw above.
>>
>> You can test this connectivity by pinging the Public API VIP from the
>> Undercloud. Starting with the failed deployment, run "neutron
>> port-list" against the Underlcloud and look for the IP on the port
>> named "public_virtual_ip". You should be able to ping this address from
>> the Undercloud. If you can't reach that IP, then you need to check the
>> connectivity/routing between the Undercloud and the External network on
>> the Overcloud.
>>
>
> I should also mention common causes of this problem:
>
> * Incorrect value for ExternalInterfaceDefaultRoute in the network
> environment file.
> * Controllers do not have the default route on the External network in
> the NIC config templates (required for reachability from remote subnets).
> * Incorrect subnet mask on the ExternalNetCidr in the network environment.
> * Incorrect ExternalAllocationPools values in the network environment.
> * Incorrect Ethernet switch config for the Controllers.
>
>         Issue has been reproduced with exactly same error 4 times
>         starting since 06/25/16 on daily basis with exactly same error at Step4
>         of overcloud-ControllerNodesPostDeployment.
>         In meantime I cannot reproduce the error.
>         Config 3xNode HA Controller + 1xCompute  works .
>         There was one more issue  3xNode HA Controller + 2xCompute
>         failed   immediately when overcloud-deploy.sh started due to
>         only 4 nodes could be introspected. I will test it tomorrow morning.
>
>         Thanks a lot.
>         Boris.
>
> --
> Dan Sneddon         |  Principal OpenStack Engineer
> dsneddon at redhat.com |  redhat.com/openstack
> 650.254.4025        |  dsneddon:irc   @dxs:twitter
>
> _______________________________________________
> rdo-list mailing list
> rdo-list at redhat.com
> https://www.redhat.com/mailman/listinfo/rdo-list
>
> To unsubscribe: rdo-list-unsubscribe at redhat.com
>
>
>
> This body part will be downloaded on demand.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rdoproject.org/pipermail/dev/attachments/20160630/c97cb462/attachment.html>


More information about the dev mailing list