[Rdo-list] Overcloud deploy stuck for a long time

Tue Oct 13 15:48:38 UTC 2015

On 10/13/2015 03:01 AM, Tzach Shefi wrote:
> So gave it a few more hours, on heat resource nothing is failed only
> create_complete and some init_complete.
> 
> Nova show
> | 61aaed37-4993-4165-93a7-3c9bf6b10a21 | overcloud-controller-0  |
> ACTIVE | -          | Running     | ctlplane=192.0.2.8 |
> | 7f9f4f52-3ee6-42d9-9275-ff88582dd6e7 | overcloud-novacompute-0 |
> BUILD  | spawning   | NOSTATE     | ctlplane=192.0.2.9 |
> 
> 
> nova show 7f9f4f52-3ee6-42d9-9275-ff88582dd6e7
> +--------------------------------------+----------------------------------------------------------+
> | Property                             |
> Value                                                    |
> +--------------------------------------+----------------------------------------------------------+
> | OS-DCF:diskConfig                    |
> MANUAL                                                   |
> | OS-EXT-AZ:availability_zone          |
> nova                                                     |
> | OS-EXT-SRV-ATTR:host                 |
> instack.localdomain                                      |
> | OS-EXT-SRV-ATTR:hypervisor_hostname  |
> 4626bf90-7f95-4bd7-8bee-5f5b0a0981c6                     |
> | OS-EXT-SRV-ATTR:instance_name        |
> instance-00000002                                        |
> | OS-EXT-STS:power_state               |
> 0                                                        |
> | OS-EXT-STS:task_state                |
> spawning                                                 |
> | OS-EXT-STS:vm_state                  |
> building                                                 |
> 
> Checking nova log this is what I see:
> 
> nova-compute.log:{"nodes": [{"target_power_state": null, "links":
> [{"href":
> "http://192.0.2.1:6385/v1/nodes/4626bf90-7f95-4bd7-8bee-5f5b0a0981c6",
> "rel": "self"}, {"href":
> "http://192.0.2.1:6385/nodes/4626bf90-7f95-4bd7-8bee-5f5b0a0981c6",
> "rel": "bookmark"}], "extra": {}, "last_error": "*Failed to change
> power state to 'power on'. Error: Failed to execute command via SSH*:
> LC_ALL=C /usr/bin/virsh --connect qemu:///system start
> baremetalbrbm_1.", "updated_at": "2015-10-12T14:36:08+00:00",
> "maintenance_reason": null, "provision_state": "deploying",
> "clean_step": {}, "uuid": "4626bf90-7f95-4bd7-8bee-5f5b0a0981c6",
> "console_enabled": false, "target_provision_state": "active",
> "provision_updated_at": "2015-10-12T14:35:18+00:00", "power_state":
> "power off", "inspection_started_at": null, "inspection_finished_at":
> null, "maintenance": false, "driver": "pxe_ssh", "reservation": null,
> "properties": {"memory_mb": "4096", "cpu_arch": "x86_64", "local_gb":
> "40", "cpus": "1", "capabilities": "boot_option:local"},
> "instance_uuid": "7f9f4f52-3ee6-42d9-9275-ff88582dd6e7", "name": null,
> "driver_info": {"ssh_username": "root", "deploy_kernel":
> "94cc528d-d91f-4ca7-876e-2d8cbec66f1b", "deploy_ramdisk":
> "057d3b42-002a-4c24-bb3f-2032b8086108", "ssh_key_contents":
> "-----BEGIN( I removed key..)END RSA PRIVATE KEY-----",
> "ssh_virt_type": "virsh", "ssh_address": "192.168.122.1"},
> "created_at": "2015-10-12T14:26:30+00:00", "ports": [{"href":
> "http://192.0.2.1:6385/v1/nodes/4626bf90-7f95-4bd7-8bee-5f5b0a0981c6/ports",
> "rel": "self"}, {"href":
> "http://192.0.2.1:6385/nodes/4626bf90-7f95-4bd7-8bee-5f5b0a0981c6/ports",
> "rel": "bookmark"}], "driver_internal_info": {"clean_steps": null,
> "root_uuid_or_disk_id": "9ff90423-9d18-4dd1-ae96-a4466b52d9d9",
> "is_whole_disk_image": false}, "instance_info": {"ramdisk":
> "82639516-289d-4603-bf0e-8131fa75ec46", "kernel":
> "665ffcb0-2afe-4e04-8910-45b92826e328", "root_gb": "40",
> "display_name": "overcloud-novacompute-0", "image_source":
> "d99f460e-c6d9-4803-99e4-51347413f348", "capabilities":
> "{\"boot_option\": \"local\"}", "memory_mb": "4096", "vcpus": "1",
> "deploy_key": "BI0FRWDTD4VGHII9JK2BYDDFR8WB1WUG", "local_gb": "40",
> "configdrive":
> "H4sICGDEG1YC/3RtcHpwcWlpZQDt3WuT29iZ2HH02Bl7Fe/G5UxSqS3vLtyesaSl2CR4p1zyhk2Ct+ateScdVxcIgiR4A5sAr95xxa/iVOUz7EfJx8m7rXyE5IDslro1mpbGox15Zv6/lrpJ4AAHN/LBwXMIShIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADhJpvx+5UQq5EqNtvzldGs+MIfewJeNv53f/7n354F6xT/3v/TjH0v/chz0L5+8Gv2f3V+n0s+Pz34u/dj982PJfvSTvxFVfXQ7vfyBlRfGvOZo+kQuWWtNVgJn/jO/d6kHzvrGWlHOjGn0TDfmjmXL30kZtZSrlXPFREaVxQM5Hon4fdl0TU7nCmqtU6urRTlZVRP1clV+knwqK/F4UFbPOuVGKZNKFNTbgVFvwO+PyPmzipqo1solX/6slszmCuKozBzKuKPdMlE5ma
> 
> 
> Any ideas on how to resolve a stuck spawning compute node, it's stuck
> hasn't changed for a few hours now.
> 
> Tzach  
> 
> Tzach
> 
> 
> On Mon, Oct 12, 2015 at 11:25 PM, Dan Sneddon <dsneddon at redhat.com
> <mailto:dsneddon at redhat.com>> wrote:
> 
>     On 10/12/2015 08:10 AM, Tzach Shefi wrote:
>     > Hi,
>     >
>     > Server running centos 7.1, vm running for undercloud got up to
>     > overcloud deploy stage.
>     > It looks like its stuck nothing advancing for a while.
>     > Ideas, what to check?
>     >
>     > [stack at instack ~]$ openstack overcloud deploy --templates
>     > Deploying templates in the directory
>     > /usr/share/openstack-tripleo-heat-templates
>     > [91665.696658] device vnet2 entered promiscuous mode
>     > [91665.781346] device vnet3 entered promiscuous mode
>     > [91675.260324] kvm [71183]: vcpu0 disabled perfctr wrmsr: 0xc1
>     data 0xffff
>     > [91675.291232] kvm [71200]: vcpu0 disabled perfctr wrmsr: 0xc1
>     data 0xffff
>     > [91767.799404] kvm: zapping shadow pages for mmio generation
>     wraparound
>     > [91767.880480] kvm: zapping shadow pages for mmio generation
>     wraparound
>     > [91768.957761] device vnet2 left promiscuous mode
>     > [91769.799446] device vnet3 left promiscuous mode
>     > [91771.223273] device vnet3 entered promiscuous mode
>     > [91771.232996] device vnet2 entered promiscuous mode
>     > [91773.733967] kvm [72245]: vcpu0 disabled perfctr wrmsr: 0xc1
>     data 0xffff
>     > [91801.270510] device vnet2 left promiscuous mode
>     >
>     >
>     > Thanks
>     > Tzach
>     >
>     >
>     > _______________________________________________
>     > Rdo-list mailing list
>     > Rdo-list at redhat.com <mailto:Rdo-list at redhat.com>
>     > https://www.redhat.com/mailman/listinfo/rdo-list
>     >
>     > To unsubscribe: rdo-list-unsubscribe at redhat.com
>     <mailto:rdo-list-unsubscribe at redhat.com>
>     >
> 
>     You're going to need a more complete command line than "openstack
>     overcloud deploy --templates". For instance, if you are using VMs for
>     your overcloud nodes, you will need to include "--libvirt-type qemu".
>     There are probably a couple of other parameters that you will need.
> 
>     You can watch the deployment using this command, which will show you
>     the progress:
> 
>     watch "heat resource-list -n 5 | grep -v COMPLETE"
> 
>     You can also explore which resources have failed:
> 
>     heat resource-list [-n 5]| grep FAILED
> 
>     And then look more closely at the failed resources:
> 
>     heat resource-show overcloud <resource>
> 
>     There are some more complete troubleshooting instructions here:
> 
>     http://docs.openstack.org/developer/tripleo-docs/troubleshooting/troubleshooting-overcloud.html
> 
>     --
>     Dan Sneddon         |  Principal OpenStack Engineer
>     dsneddon at redhat.com <mailto:dsneddon at redhat.com> | 
>     redhat.com/openstack <http://redhat.com/openstack>
>     650.254.4025 <tel:650.254.4025>        |  dsneddon:irc   @dxs:twitter
> 
>     _______________________________________________
>     Rdo-list mailing list
>     Rdo-list at redhat.com <mailto:Rdo-list at redhat.com>
>     https://www.redhat.com/mailman/listinfo/rdo-list
> 
>     To unsubscribe: rdo-list-unsubscribe at redhat.com
>     <mailto:rdo-list-unsubscribe at redhat.com>
> 
> 
> 
> 
> -- 
> *Tzach Shefi*
> Quality Engineer, Redhat OSP
> +972-54-4701080 <callto:+972-52-4534729>

The deployment looks like it is stuck to me. The problem, though,
appears to be an inability to set the power state on one of the VM
nodes through libvirt.

What the SSH driver does for virt is to SSH from the Undercloud VM to
the VM host system, and issue libvirt commands to start/stop VMs. That
process failed when setting the power state of one of your nodes, and
it doesn't look like the deployment is recovering from that error.

I'm not quite sure why that is happening, but I can think of a few
possible reasons:

* SSH daemon not running on the virt host
* The virt host was not able to respond to the request, perhaps it was
overloaded?
* Firewall blocking SSH connections from the Instack VM to the virt host?

One tip for the next deployment: You can set the timeout. That way, if
it does get hung up you don't have to wait 4 hours for it to fail.
Conservatively, you could set --timeout 90 to set the timeout to 90
minutes. A 2-node deployment will definitely either deploy or fail in
that amount of time (probably much less, but I wouldn't want you to cut
off a deployment that might be successful if given a little more time).

-- 
Dan Sneddon         |  Principal OpenStack Engineer
dsneddon at redhat.com |  redhat.com/openstack
650.254.4025        |  dsneddon:irc   @dxs:twitter