[Rdo-list] Overcloud deploy stuck for a long time

Udi Kalifon ukalifon at redhat.com
Wed Oct 14 11:40:23 UTC 2015


My overcloud deployment also hangs for 4 hours and then fails. This is what
I got on the 1st run:

[stack at instack ~]$ openstack overcloud deploy --templates
Deploying templates in the directory
/usr/share/openstack-tripleo-heat-templates
ERROR: Authentication failed. Please try again with option
--include-password or export HEAT_INCLUDE_PASSWORD=1
Authentication required

I am assuming the authentication error is due to the expiration of the
token after 4 hours, and not because I forgot the rc file. I tried to run
the deployment again and it failed after another 4 hours with a different
error:

[stack at instack ~]$ openstack overcloud deploy --templates
Deploying templates in the directory
/usr/share/openstack-tripleo-heat-templates
Stack failed with status: resources.Controller: resources[0]:
ResourceInError: resources.Controller: Went to status ERROR due to
"Message: Exceeded maximum number of retries. Exceeded max scheduling
attempts 3 for instance 9eedda9e-f381-47d4-a883-0fe40db0eb5e. Last
exception: [u'Traceback (most recent call last):\n', u'  File
"/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1, Code:
500"
Heat Stack update failed.

The failed resources are:

heat resource-list -n 5 overcloud |egrep -v COMPLETE
+-------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------+
| resource_name                             |
physical_resource_id                          |
resource_type                                     | resource_status |
updated_time        |
stack_name
|
+-------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------+
| Compute                                   |
aee2604f-2580-44c9-bc38-45046970fd63          |
OS::Heat::ResourceGroup                           | UPDATE_FAILED   |
2015-10-14T06:32:34 |
overcloud
|
| 0                                         |
2199c1c6-60ca-42a4-927c-8bf0fb8763b7          |
OS::TripleO::Compute                              | UPDATE_FAILED   |
2015-10-14T06:32:36 |
overcloud-Compute-dq426vplp2nu
|
| Controller                                |
2ae19a5f-f88c-4d8b-98ec-952657b70cd6          |
OS::Heat::ResourceGroup                           | UPDATE_FAILED   |
2015-10-14T06:32:36 |
overcloud
|
| 0                                         |
2fc3ed0c-da5c-45e4-a255-4b4a8ef58dd7          |
OS::TripleO::Controller                           | UPDATE_FAILED   |
2015-10-14T06:32:38 |
overcloud-Controller-ktbqsolaqm4u
|
| NovaCompute                               |
7938bbe0-ab97-499f-8859-15f903e7c09b          |
OS::Nova::Server                                  | CREATE_FAILED   |
2015-10-14T06:32:55 |
overcloud-Compute-dq426vplp2nu-0-4acm6pstctor
|
| Controller                                |
c1cd6b72-ec0d-4c13-b21c-10d0f6c45788          |
OS::Nova::Server                                  | CREATE_FAILED   |
2015-10-14T06:32:58 |
overcloud-Controller-ktbqsolaqm4u-0-d76rtersrtyt
|
+-------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------+


I was unable to run resource-show or deployment-show on the failed
resources, it kept complaining that those resources are not found.

Thanks,
Udi.


On Wed, Oct 14, 2015 at 11:16 AM, Tzach Shefi <tshefi at redhat.com> wrote:

> Hi  Sasha\Dan,
> Yep that's my bug I opened yesterday about this.
>
> sshd and firewall rules look OK having tested below:
> I can ssh into the virt host from my laptop with root user, checking
> 10.X.X.X net
> Can also ssh from instack vm to virt host, checking 192.168.122.X net.
>
> Unless I should check ssh with other user, if so which ?
> I doubt ssh user/firewall caused the problem as controller was installed
> successfully and it too uses same procedure ssh virt power-on method.
>
> Deployment is still up & stuck if any one ones to take a look contact me
> for access details in private.
>
> Will review/use  virt console, virt journal and timeout tips on next
> deployment.
>
> Thanks
> Tzach
>
>
> On Wed, Oct 14, 2015 at 5:07 AM, Sasha Chuzhoy <sasha at redhat.com> wrote:
>
>> I hit the same (or similar) issue on my BM environment, though I manage
>> to complete the 1+1 deployment on VM successfully.
>> I see it's reported already:
>> https://bugzilla.redhat.com/show_bug.cgi?id=1271289
>>
>> Ran a deployment with:   openstack overcloud deploy --templates --timeout
>> 90 --compute-scale 3 --control-scale 1
>> The deployment fails, and I see that "all minus one" overcloud nodes are
>> still in BUILD status.
>>
>> [stack at undercloud ~]$ nova list
>>
>> +--------------------------------------+-------------------------+--------+------------+-------------+---------------------+
>> | ID                                   | Name                    | Status
>> | Task State | Power State | Networks            |
>>
>> +--------------------------------------+-------------------------+--------+------------+-------------+---------------------+
>> | b15f499e-79ed-46b2-b990-878dbe6310b1 | overcloud-controller-0  | BUILD
>> | spawning   | NOSTATE     | ctlplane=192.0.2.23 |
>> | 4877d14a-e34e-406b-8005-dad3d79f5bab | overcloud-novacompute-0 | ACTIVE
>> | -          | Running     | ctlplane=192.0.2.9  |
>> | 0fd1a7ed-367e-448e-8602-8564bf087e92 | overcloud-novacompute-1 | BUILD
>> | spawning   | NOSTATE     | ctlplane=192.0.2.21 |
>> | 51630a7d-c140-47b9-a071-1f2fdb45f4b4 | overcloud-novacompute-2 | BUILD
>> | spawning   | NOSTATE     | ctlplane=192.0.2.22 |
>>
>>
>> Will try to investigate further tomorrow.
>>
>> Best regards,
>> Sasha Chuzhoy.
>>
>> ----- Original Message -----
>> > From: "Tzach Shefi" <tshefi at redhat.com>
>> > To: "Dan Sneddon" <dsneddon at redhat.com>
>> > Cc: rdo-list at redhat.com
>> > Sent: Tuesday, October 13, 2015 6:01:48 AM
>> > Subject: Re: [Rdo-list] Overcloud deploy stuck for a long time
>> >
>> > So gave it a few more hours, on heat resource nothing is failed only
>> > create_complete and some init_complete.
>> >
>> > Nova show
>> > | 61aaed37-4993-4165-93a7-3c9bf6b10a21 | overcloud-controller-0 |
>> ACTIVE | -
>> > | | Running | ctlplane=192.0.2.8 |
>> > | 7f9f4f52-3ee6-42d9-9275-ff88582dd6e7 | overcloud-novacompute-0 |
>> BUILD |
>> > | spawning | NOSTATE | ctlplane=192.0.2.9 |
>> >
>> >
>> > nova show 7f9f4f52-3ee6-42d9-9275-ff88582dd6e7
>> >
>> +--------------------------------------+----------------------------------------------------------+
>> > | Property | Value |
>> >
>> +--------------------------------------+----------------------------------------------------------+
>> > | OS-DCF:diskConfig | MANUAL |
>> > | OS-EXT-AZ:availability_zone | nova |
>> > | OS-EXT-SRV-ATTR:host | instack.localdomain |
>> > | OS-EXT-SRV-ATTR:hypervisor_hostname |
>> 4626bf90-7f95-4bd7-8bee-5f5b0a0981c6
>> > | |
>> > | OS-EXT-SRV-ATTR:instance_name | instance-00000002 |
>> > | OS-EXT-STS:power_state | 0 |
>> > | OS-EXT-STS:task_state | spawning |
>> > | OS-EXT-STS:vm_state | building |
>> >
>> > Checking nova log this is what I see:
>> >
>> > nova-compute.log:{"nodes": [{"target_power_state": null, "links":
>> [{"href": "
>> > http://192.0.2.1:6385/v1/nodes/4626bf90-7f95-4bd7-8bee-5f5b0a0981c6 ",
>> > "rel": "self"}, {"href": "
>> > http://192.0.2.1:6385/nodes/4626bf90-7f95-4bd7-8bee-5f5b0a0981c6 ",
>> "rel":
>> > "bookmark"}], "extra": {}, "last_error": " Failed to change power state
>> to
>> > 'power on'. Error: Failed to execute command via SSH : LC_ALL=C
>> > /usr/bin/virsh --connect qemu:///system start baremetalbrbm_1.",
>> > "updated_at": "2015-10-12T14:36:08+00:00", "maintenance_reason": null,
>> > "provision_state": "deploying", "clean_step": {}, "uuid":
>> > "4626bf90-7f95-4bd7-8bee-5f5b0a0981c6", "console_enabled": false,
>> > "target_provision_state": "active", "provision_updated_at":
>> > "2015-10-12T14:35:18+00:00", "power_state": "power off",
>> > "inspection_started_at": null, "inspection_finished_at": null,
>> > "maintenance": false, "driver": "pxe_ssh", "reservation": null,
>> > "properties": {"memory_mb": "4096", "cpu_arch": "x86_64", "local_gb":
>> "40",
>> > "cpus": "1", "capabilities": "boot_option:local"}, "instance_uuid":
>> > "7f9f4f52-3ee6-42d9-9275-ff88582dd6e7", "name": null, "driver_info":
>> > {"ssh_username": "root", "deploy_kernel":
>> > "94cc528d-d91f-4ca7-876e-2d8cbec66f1b", "deploy_ramdisk":
>> > "057d3b42-002a-4c24-bb3f-2032b8086108", "ssh_key_contents":
>> "-----BEGIN( I
>> > removed key..)END RSA PRIVATE KEY-----", "ssh_virt_type": "virsh",
>> > "ssh_address": "192.168.122.1"}, "created_at":
>> "2015-10-12T14:26:30+00:00",
>> > "ports": [{"href": "
>> >
>> http://192.0.2.1:6385/v1/nodes/4626bf90-7f95-4bd7-8bee-5f5b0a0981c6/ports
>> ",
>> > "rel": "self"}, {"href": "
>> > http://192.0.2.1:6385/nodes/4626bf90-7f95-4bd7-8bee-5f5b0a0981c6/ports
>> ",
>> > "rel": "bookmark"}], "driver_internal_info": {"clean_steps": null,
>> > "root_uuid_or_disk_id": "9ff90423-9d18-4dd1-ae96-a4466b52d9d9",
>> > "is_whole_disk_image": false}, "instance_info": {"ramdisk":
>> > "82639516-289d-4603-bf0e-8131fa75ec46", "kernel":
>> > "665ffcb0-2afe-4e04-8910-45b92826e328", "root_gb": "40", "display_name":
>> > "overcloud-novacompute-0", "image_source":
>> > "d99f460e-c6d9-4803-99e4-51347413f348", "capabilities":
>> "{\"boot_option\":
>> > \"local\"}", "memory_mb": "4096", "vcpus": "1", "deploy_key":
>> > "BI0FRWDTD4VGHII9JK2BYDDFR8WB1WUG", "local_gb": "40", "configdrive":
>> >
>> "H4sICGDEG1YC/3RtcHpwcWlpZQDt3WuT29iZ2HH02Bl7Fe/G5UxSqS3vLtyesaSl2CR4p1zyhk2Ct+ateScdVxcIgiR4A5sAr95xxa/iVOUz7EfJx8m7rXyE5IDslro1mpbGox15Zv6/lrpJ4AAHN/LBwXMIShIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADhJpvx+5UQq5EqNtvzldGs+MIfewJeNv53f/7n354F6xT/3v/TjH0v/chz0L5+8Gv2f3V+n0s+Pz34u/dj982PJfvSTvxFVfXQ7vfyBlRfGvOZo+kQuWWtNVgJn/jO/d6kHzvrGWlHOjGn0TDfmjmXL30kZtZSrlXPFREaVxQM5Hon4fdl0TU7nCmqtU6urRTlZVRP1clV+knwqK/F4UFbPOuVGKZNKFNTbgVFvwO+PyPmzipqo1solX/6slszmCuKozBzKuKPdMlE5ma
>> >
>> >
>> > Any ideas on how to resolve a stuck spawning compute node, it's stuck
>> hasn't
>> > changed for a few hours now.
>> >
>> > Tzach
>> >
>> > Tzach
>> >
>> >
>> > On Mon, Oct 12, 2015 at 11:25 PM, Dan Sneddon < dsneddon at redhat.com >
>> wrote:
>> >
>> >
>> >
>> > On 10/12/2015 08:10 AM, Tzach Shefi wrote:
>> > > Hi,
>> > >
>> > > Server running centos 7.1, vm running for undercloud got up to
>> > > overcloud deploy stage.
>> > > It looks like its stuck nothing advancing for a while.
>> > > Ideas, what to check?
>> > >
>> > > [stack at instack ~]$ openstack overcloud deploy --templates
>> > > Deploying templates in the directory
>> > > /usr/share/openstack-tripleo-heat-templates
>> > > [91665.696658] device vnet2 entered promiscuous mode
>> > > [91665.781346] device vnet3 entered promiscuous mode
>> > > [91675.260324] kvm [71183]: vcpu0 disabled perfctr wrmsr: 0xc1 data
>> 0xffff
>> > > [91675.291232] kvm [71200]: vcpu0 disabled perfctr wrmsr: 0xc1 data
>> 0xffff
>> > > [91767.799404] kvm: zapping shadow pages for mmio generation
>> wraparound
>> > > [91767.880480] kvm: zapping shadow pages for mmio generation
>> wraparound
>> > > [91768.957761] device vnet2 left promiscuous mode
>> > > [91769.799446] device vnet3 left promiscuous mode
>> > > [91771.223273] device vnet3 entered promiscuous mode
>> > > [91771.232996] device vnet2 entered promiscuous mode
>> > > [91773.733967] kvm [72245]: vcpu0 disabled perfctr wrmsr: 0xc1 data
>> 0xffff
>> > > [91801.270510] device vnet2 left promiscuous mode
>> > >
>> > >
>> > > Thanks
>> > > Tzach
>> > >
>> > >
>> > > _______________________________________________
>> > > Rdo-list mailing list
>> > > Rdo-list at redhat.com
>> > > https://www.redhat.com/mailman/listinfo/rdo-list
>> > >
>> > > To unsubscribe: rdo-list-unsubscribe at redhat.com
>> > >
>> >
>> > You're going to need a more complete command line than "openstack
>> > overcloud deploy --templates". For instance, if you are using VMs for
>> > your overcloud nodes, you will need to include "--libvirt-type qemu".
>> > There are probably a couple of other parameters that you will need.
>> >
>> > You can watch the deployment using this command, which will show you
>> > the progress:
>> >
>> > watch "heat resource-list -n 5 | grep -v COMPLETE"
>> >
>> > You can also explore which resources have failed:
>> >
>> > heat resource-list [-n 5]| grep FAILED
>> >
>> > And then look more closely at the failed resources:
>> >
>> > heat resource-show overcloud <resource>
>> >
>> > There are some more complete troubleshooting instructions here:
>> >
>> >
>> http://docs.openstack.org/developer/tripleo-docs/troubleshooting/troubleshooting-overcloud.html
>> >
>> > --
>> > Dan Sneddon | Principal OpenStack Engineer
>> > dsneddon at redhat.com | redhat.com/openstack
>> > 650.254.4025 | dsneddon:irc @dxs:twitter
>> >
>> > _______________________________________________
>> > Rdo-list mailing list
>> > Rdo-list at redhat.com
>> > https://www.redhat.com/mailman/listinfo/rdo-list
>> >
>> > To unsubscribe: rdo-list-unsubscribe at redhat.com
>> >
>> >
>> >
>> > --
>> > Tzach Shefi
>> > Quality Engineer, Redhat OSP
>> > +972-54-4701080
>> >
>> > _______________________________________________
>> > Rdo-list mailing list
>> > Rdo-list at redhat.com
>> > https://www.redhat.com/mailman/listinfo/rdo-list
>> >
>> > To unsubscribe: rdo-list-unsubscribe at redhat.com
>>
>
>
>
> --
> *Tzach Shefi*
> Quality Engineer, Redhat OSP
> +972-54-4701080 <callto:+972-52-4534729>
>
> _______________________________________________
> Rdo-list mailing list
> Rdo-list at redhat.com
> https://www.redhat.com/mailman/listinfo/rdo-list
>
> To unsubscribe: rdo-list-unsubscribe at redhat.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rdoproject.org/pipermail/dev/attachments/20151014/7875bb45/attachment.html>


More information about the dev mailing list