[Rdo-list] Overcloud deploy stuck for a long time

Tue Oct 13 17:56:38 UTC 2015

----- Original Message -----
> From: "Dan Sneddon" <dsneddon at redhat.com>
> To: "Tzach Shefi" <tshefi at redhat.com>
> Cc: rdo-list at redhat.com
> Sent: Tuesday, October 13, 2015 5:48:38 PM
> Subject: Re: [Rdo-list] Overcloud deploy stuck for a long time
> 
> On 10/13/2015 03:01 AM, Tzach Shefi wrote:
> > So gave it a few more hours, on heat resource nothing is failed only
> > create_complete and some init_complete.
> > 
> > Nova show
> > | 61aaed37-4993-4165-93a7-3c9bf6b10a21 | overcloud-controller-0  |
> > ACTIVE | -          | Running     | ctlplane=192.0.2.8 |
> > | 7f9f4f52-3ee6-42d9-9275-ff88582dd6e7 | overcloud-novacompute-0 |
> > BUILD  | spawning   | NOSTATE     | ctlplane=192.0.2.9 |
> > 
> > 
> > nova show 7f9f4f52-3ee6-42d9-9275-ff88582dd6e7
> > +--------------------------------------+----------------------------------------------------------+
> > | Property                             |
> > Value                                                    |
> > +--------------------------------------+----------------------------------------------------------+
> > | OS-DCF:diskConfig                    |
> > MANUAL                                                   |
> > | OS-EXT-AZ:availability_zone          |
> > nova                                                     |
> > | OS-EXT-SRV-ATTR:host                 |
> > instack.localdomain                                      |
> > | OS-EXT-SRV-ATTR:hypervisor_hostname  |
> > 4626bf90-7f95-4bd7-8bee-5f5b0a0981c6                     |
> > | OS-EXT-SRV-ATTR:instance_name        |
> > instance-00000002                                        |
> > | OS-EXT-STS:power_state               |
> > 0                                                        |
> > | OS-EXT-STS:task_state                |
> > spawning                                                 |
> > | OS-EXT-STS:vm_state                  |
> > building                                                 |
> > 
> > Checking nova log this is what I see:
> > 
> > nova-compute.log:{"nodes": [{"target_power_state": null, "links":
> > [{"href":
> > "http://192.0.2.1:6385/v1/nodes/4626bf90-7f95-4bd7-8bee-5f5b0a0981c6",
> > "rel": "self"}, {"href":
> > "http://192.0.2.1:6385/nodes/4626bf90-7f95-4bd7-8bee-5f5b0a0981c6",
> > "rel": "bookmark"}], "extra": {}, "last_error": "*Failed to change
> > power state to 'power on'. Error: Failed to execute command via SSH*:
> > LC_ALL=C /usr/bin/virsh --connect qemu:///system start
> > baremetalbrbm_1.", "updated_at": "2015-10-12T14:36:08+00:00",
> > "maintenance_reason": null, "provision_state": "deploying",
> > "clean_step": {}, "uuid": "4626bf90-7f95-4bd7-8bee-5f5b0a0981c6",
> > "console_enabled": false, "target_provision_state": "active",
> > "provision_updated_at": "2015-10-12T14:35:18+00:00", "power_state":
> > "power off", "inspection_started_at": null, "inspection_finished_at":
> > null, "maintenance": false, "driver": "pxe_ssh", "reservation": null,
> > "properties": {"memory_mb": "4096", "cpu_arch": "x86_64", "local_gb":
> > "40", "cpus": "1", "capabilities": "boot_option:local"},
> > "instance_uuid": "7f9f4f52-3ee6-42d9-9275-ff88582dd6e7", "name": null,
> > "driver_info": {"ssh_username": "root", "deploy_kernel":
> > "94cc528d-d91f-4ca7-876e-2d8cbec66f1b", "deploy_ramdisk":
> > "057d3b42-002a-4c24-bb3f-2032b8086108", "ssh_key_contents":
> > "-----BEGIN( I removed key..)END RSA PRIVATE KEY-----",
> > "ssh_virt_type": "virsh", "ssh_address": "192.168.122.1"},
> > "created_at": "2015-10-12T14:26:30+00:00", "ports": [{"href":
> > "http://192.0.2.1:6385/v1/nodes/4626bf90-7f95-4bd7-8bee-5f5b0a0981c6/ports",
> > "rel": "self"}, {"href":
> > "http://192.0.2.1:6385/nodes/4626bf90-7f95-4bd7-8bee-5f5b0a0981c6/ports",
> > "rel": "bookmark"}], "driver_internal_info": {"clean_steps": null,
> > "root_uuid_or_disk_id": "9ff90423-9d18-4dd1-ae96-a4466b52d9d9",
> > "is_whole_disk_image": false}, "instance_info": {"ramdisk":
> > "82639516-289d-4603-bf0e-8131fa75ec46", "kernel":
> > "665ffcb0-2afe-4e04-8910-45b92826e328", "root_gb": "40",
> > "display_name": "overcloud-novacompute-0", "image_source":
> > "d99f460e-c6d9-4803-99e4-51347413f348", "capabilities":
> > "{\"boot_option\": \"local\"}", "memory_mb": "4096", "vcpus": "1",
> > "deploy_key": "BI0FRWDTD4VGHII9JK2BYDDFR8WB1WUG", "local_gb": "40",
> > "configdrive":
> > "H4sICGDEG1YC/3RtcHpwcWlpZQDt3WuT29iZ2HH02Bl7Fe/G5UxSqS3vLtyesaSl2CR4p1zyhk2Ct+ateScdVxcIgiR4A5sAr95xxa/iVOUz7EfJx8m7rXyE5IDslro1mpbGox15Zv6/lrpJ4AAHN/LBwXMIShIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADhJpvx+5UQq5EqNtvzldGs+MIfewJeNv53f/7n354F6xT/3v/TjH0v/chz0L5+8Gv2f3V+n0s+Pz34u/dj982PJfvSTvxFVfXQ7vfyBlRfGvOZo+kQuWWtNVgJn/jO/d6kHzvrGWlHOjGn0TDfmjmXL30kZtZSrlXPFREaVxQM5Hon4fdl0TU7nCmqtU6urRTlZVRP1clV+knwqK/F4UFbPOuVGKZNKFNTbgVFvwO+PyPmzipqo1solX/6slszmCuKozBzKuKPdMlE5ma
> > 
> > 
> > Any ideas on how to resolve a stuck spawning compute node, it's stuck
> > hasn't changed for a few hours now.
> > 
> > Tzach
> > 
> > Tzach
> > 
> > 
> > On Mon, Oct 12, 2015 at 11:25 PM, Dan Sneddon <dsneddon at redhat.com
> > <mailto:dsneddon at redhat.com>> wrote:
> > 
> >     On 10/12/2015 08:10 AM, Tzach Shefi wrote:
> >     > Hi,
> >     >
> >     > Server running centos 7.1, vm running for undercloud got up to
> >     > overcloud deploy stage.
> >     > It looks like its stuck nothing advancing for a while.
> >     > Ideas, what to check?
> >     >
> >     > [stack at instack ~]$ openstack overcloud deploy --templates
> >     > Deploying templates in the directory
> >     > /usr/share/openstack-tripleo-heat-templates
> >     > [91665.696658] device vnet2 entered promiscuous mode
> >     > [91665.781346] device vnet3 entered promiscuous mode
> >     > [91675.260324] kvm [71183]: vcpu0 disabled perfctr wrmsr: 0xc1
> >     data 0xffff
> >     > [91675.291232] kvm [71200]: vcpu0 disabled perfctr wrmsr: 0xc1
> >     data 0xffff
> >     > [91767.799404] kvm: zapping shadow pages for mmio generation
> >     wraparound
> >     > [91767.880480] kvm: zapping shadow pages for mmio generation
> >     wraparound
> >     > [91768.957761] device vnet2 left promiscuous mode
> >     > [91769.799446] device vnet3 left promiscuous mode
> >     > [91771.223273] device vnet3 entered promiscuous mode
> >     > [91771.232996] device vnet2 entered promiscuous mode
> >     > [91773.733967] kvm [72245]: vcpu0 disabled perfctr wrmsr: 0xc1
> >     data 0xffff
> >     > [91801.270510] device vnet2 left promiscuous mode
> >     >
> >     >
> >     > Thanks
> >     > Tzach
> >     >
> >     >
> >     > _______________________________________________
> >     > Rdo-list mailing list
> >     > Rdo-list at redhat.com <mailto:Rdo-list at redhat.com>
> >     > https://www.redhat.com/mailman/listinfo/rdo-list
> >     >
> >     > To unsubscribe: rdo-list-unsubscribe at redhat.com
> >     <mailto:rdo-list-unsubscribe at redhat.com>
> >     >
> > 
> >     You're going to need a more complete command line than "openstack
> >     overcloud deploy --templates". For instance, if you are using VMs for
> >     your overcloud nodes, you will need to include "--libvirt-type qemu".
> >     There are probably a couple of other parameters that you will need.
> > 
> >     You can watch the deployment using this command, which will show you
> >     the progress:
> > 
> >     watch "heat resource-list -n 5 | grep -v COMPLETE"
> > 
> >     You can also explore which resources have failed:
> > 
> >     heat resource-list [-n 5]| grep FAILED
> > 
> >     And then look more closely at the failed resources:
> > 
> >     heat resource-show overcloud <resource>
> > 
> >     There are some more complete troubleshooting instructions here:
> > 
> >     http://docs.openstack.org/developer/tripleo-docs/troubleshooting/troubleshooting-overcloud.html
> > 
> >     --
> >     Dan Sneddon         |  Principal OpenStack Engineer
> >     dsneddon at redhat.com <mailto:dsneddon at redhat.com> |
> >     redhat.com/openstack <http://redhat.com/openstack>
> >     650.254.4025 <tel:650.254.4025>        |  dsneddon:irc   @dxs:twitter
> > 
> >     _______________________________________________
> >     Rdo-list mailing list
> >     Rdo-list at redhat.com <mailto:Rdo-list at redhat.com>
> >     https://www.redhat.com/mailman/listinfo/rdo-list
> > 
> >     To unsubscribe: rdo-list-unsubscribe at redhat.com
> >     <mailto:rdo-list-unsubscribe at redhat.com>
> > 
> > 
> > 
> > 
> > --
> > *Tzach Shefi*
> > Quality Engineer, Redhat OSP
> > +972-54-4701080 <callto:+972-52-4534729>
> 
> The deployment looks like it is stuck to me. The problem, though,
> appears to be an inability to set the power state on one of the VM
> nodes through libvirt.
> 
> What the SSH driver does for virt is to SSH from the Undercloud VM to
> the VM host system, and issue libvirt commands to start/stop VMs. That
> process failed when setting the power state of one of your nodes, and
> it doesn't look like the deployment is recovering from that error.
> 
> I'm not quite sure why that is happening, but I can think of a few
> possible reasons:
> 
> * SSH daemon not running on the virt host
> * The virt host was not able to respond to the request, perhaps it was
> overloaded?
> * Firewall blocking SSH connections from the Instack VM to the virt host?
> 
> One tip for the next deployment: You can set the timeout. That way, if
> it does get hung up you don't have to wait 4 hours for it to fail.
> Conservatively, you could set --timeout 90 to set the timeout to 90
> minutes. A 2-node deployment will definitely either deploy or fail in
> that amount of time (probably much less, but I wouldn't want you to cut
> off a deployment that might be successful if given a little more time).

For virt environments you'll also find useful to use virt-manager and connect to the virt host so you can see whether the VMs are running and watch their consoles during introspection/deploy. Also watch the libvirtd logs on the virt host(journalctl -fl -u libvirtd)  

> --
> Dan Sneddon         |  Principal OpenStack Engineer
> dsneddon at redhat.com |  redhat.com/openstack
> 650.254.4025        |  dsneddon:irc   @dxs:twitter
> 
> _______________________________________________
> Rdo-list mailing list
> Rdo-list at redhat.com
> https://www.redhat.com/mailman/listinfo/rdo-list
> 
> To unsubscribe: rdo-list-unsubscribe at redhat.com
>