Re: [Rdo-list] Overcloud deploy stuck for a long time

Tuesday, 13 October 2015

On 10/13/2015 03:01 AM, Tzach Shefi wrote:
...
 So gave it a few more hours, on heat resource nothing is failed only
 create_complete and some init_complete.

 Nova show
 | 61aaed37-4993-4165-93a7-3c9bf6b10a21 | overcloud-controller-0  |
 ACTIVE | -          | Running     | ctlplane=192.0.2.8 |
 | 7f9f4f52-3ee6-42d9-9275-ff88582dd6e7 | overcloud-novacompute-0 |
 BUILD  | spawning   | NOSTATE     | ctlplane=192.0.2.9 |

 nova show 7f9f4f52-3ee6-42d9-9275-ff88582dd6e7

+--------------------------------------+----------------------------------------------------------+
 | Property                             |
 Value                                                    |

+--------------------------------------+----------------------------------------------------------+
 | OS-DCF:diskConfig                    |
 MANUAL                                                   |
 | OS-EXT-AZ:availability_zone          |
 nova                                                     |
 | OS-EXT-SRV-ATTR:host                 |
 instack.localdomain                                      |
 | OS-EXT-SRV-ATTR:hypervisor_hostname  |
 4626bf90-7f95-4bd7-8bee-5f5b0a0981c6                     |
 | OS-EXT-SRV-ATTR:instance_name        |
 instance-00000002                                        |
 | OS-EXT-STS:power_state               |
 0                                                        |
 | OS-EXT-STS:task_state                |
 spawning                                                 |
 | OS-EXT-STS:vm_state                  |
 building                                                 |

 Checking nova log this is what I see:

 nova-compute.log:{"nodes": [{"target_power_state": null,
"links":
 [{"href":
 "http://192.0.2.1:6385/v1/nodes/4626bf90-7f95-4bd7-8bee-5f5b0a0981c6",
 "rel": "self"}, {"href":
 "http://192.0.2.1:6385/nodes/4626bf90-7f95-4bd7-8bee-5f5b0a0981c6",
 "rel": "bookmark"}], "extra": {}, "last_error":
"*Failed to change
 power state to 'power on'. Error: Failed to execute command via SSH*:
 LC_ALL=C /usr/bin/virsh --connect qemu:///system start
 baremetalbrbm_1.", "updated_at": "2015-10-12T14:36:08+00:00",
 "maintenance_reason": null, "provision_state":
"deploying",
 "clean_step": {}, "uuid":
"4626bf90-7f95-4bd7-8bee-5f5b0a0981c6",
 "console_enabled": false, "target_provision_state":
"active",
 "provision_updated_at": "2015-10-12T14:35:18+00:00",
"power_state":
 "power off", "inspection_started_at": null,
"inspection_finished_at":
 null, "maintenance": false, "driver": "pxe_ssh",
"reservation": null,
 "properties": {"memory_mb": "4096", "cpu_arch":
"x86_64", "local_gb":
 "40", "cpus": "1", "capabilities":
"boot_option:local"},
 "instance_uuid": "7f9f4f52-3ee6-42d9-9275-ff88582dd6e7",
"name": null,
 "driver_info": {"ssh_username": "root",
"deploy_kernel":
 "94cc528d-d91f-4ca7-876e-2d8cbec66f1b", "deploy_ramdisk":
 "057d3b42-002a-4c24-bb3f-2032b8086108", "ssh_key_contents":
 "-----BEGIN( I removed key..)END RSA PRIVATE KEY-----",
 "ssh_virt_type": "virsh", "ssh_address":
"192.168.122.1"},
 "created_at": "2015-10-12T14:26:30+00:00", "ports":
[{"href":
 "http://192.0.2.1:6385/v1/nodes/4626bf90-7f95-4bd7-8bee-5f5b0a0981c6/ports",
 "rel": "self"}, {"href":
 "http://192.0.2.1:6385/nodes/4626bf90-7f95-4bd7-8bee-5f5b0a0981c6/ports",
 "rel": "bookmark"}], "driver_internal_info":
{"clean_steps": null,
 "root_uuid_or_disk_id": "9ff90423-9d18-4dd1-ae96-a4466b52d9d9",
 "is_whole_disk_image": false}, "instance_info":
{"ramdisk":
 "82639516-289d-4603-bf0e-8131fa75ec46", "kernel":
 "665ffcb0-2afe-4e04-8910-45b92826e328", "root_gb": "40",
 "display_name": "overcloud-novacompute-0", "image_source":
 "d99f460e-c6d9-4803-99e4-51347413f348", "capabilities":
 "{\"boot_option\": \"local\"}", "memory_mb":
"4096", "vcpus": "1",
 "deploy_key": "BI0FRWDTD4VGHII9JK2BYDDFR8WB1WUG",
"local_gb": "40",
 "configdrive":

"H4sICGDEG1YC/3RtcHpwcWlpZQDt3WuT29iZ2HH02Bl7Fe/G5UxSqS3vLtyesaSl2CR4p1zyhk2Ct+ateScdVxcIgiR4A5sAr95xxa/iVOUz7EfJx8m7rXyE5IDslro1mpbGox15Zv6/lrpJ4AAHN/LBwXMIShIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADhJpvx+5UQq5EqNtvzldGs+MIfewJeNv53f/7n354F6xT/3v/TjH0v/chz0L5+8Gv2f3V+n0s+Pz34u/dj982PJfvSTvxFVfXQ7vfyBlRfGvOZo+kQuWWtNVgJn/jO/d6kHzvrGWlHOjGn0TDfmjmXL30kZtZSrlXPFREaVxQM5Hon4fdl0TU7nCmqtU6urRTlZVRP1clV+knwqK/F4UFbPOuVGKZNKFNTbgVFvwO+PyPmzipqo1solX/6slszmCuKozBzKuKPdMlE5ma

 Any ideas on how to resolve a stuck spawning compute node, it's stuck
 hasn't changed for a few hours now.

 Tzach  

 Tzach

 On Mon, Oct 12, 2015 at 11:25 PM, Dan Sneddon <dsneddon(a)redhat.com
 <mailto:dsneddon@redhat.com>> wrote:

     On 10/12/2015 08:10 AM, Tzach Shefi wrote:
     > Hi,
     >
     > Server running centos 7.1, vm running for undercloud got up to
     > overcloud deploy stage.
     > It looks like its stuck nothing advancing for a while.
     > Ideas, what to check?
     >
     > [stack@instack ~]$ openstack overcloud deploy --templates
     > Deploying templates in the directory
     > /usr/share/openstack-tripleo-heat-templates
     > [91665.696658] device vnet2 entered promiscuous mode
     > [91665.781346] device vnet3 entered promiscuous mode
     > [91675.260324] kvm [71183]: vcpu0 disabled perfctr wrmsr: 0xc1
     data 0xffff
     > [91675.291232] kvm [71200]: vcpu0 disabled perfctr wrmsr: 0xc1
     data 0xffff
     > [91767.799404] kvm: zapping shadow pages for mmio generation
     wraparound
     > [91767.880480] kvm: zapping shadow pages for mmio generation
     wraparound
     > [91768.957761] device vnet2 left promiscuous mode
     > [91769.799446] device vnet3 left promiscuous mode
     > [91771.223273] device vnet3 entered promiscuous mode
     > [91771.232996] device vnet2 entered promiscuous mode
     > [91773.733967] kvm [72245]: vcpu0 disabled perfctr wrmsr: 0xc1
     data 0xffff
     > [91801.270510] device vnet2 left promiscuous mode
     >
     >
     > Thanks
     > Tzach
     >
     >
     > _______________________________________________
     > Rdo-list mailing list
     > Rdo-list(a)redhat.com <mailto:Rdo-list@redhat.com>
     > https://www.redhat.com/mailman/listinfo/rdo-list
     >
     > To unsubscribe: rdo-list-unsubscribe(a)redhat.com
     <mailto:rdo-list-unsubscribe@redhat.com>
     >

     You're going to need a more complete command line than "openstack
     overcloud deploy --templates". For instance, if you are using VMs for
     your overcloud nodes, you will need to include "--libvirt-type qemu".
     There are probably a couple of other parameters that you will need.

     You can watch the deployment using this command, which will show you
     the progress:

     watch "heat resource-list -n 5 | grep -v COMPLETE"

     You can also explore which resources have failed:

     heat resource-list [-n 5]| grep FAILED

     And then look more closely at the failed resources:

     heat resource-show overcloud <resource>

     There are some more complete troubleshooting instructions here:

http://docs.openstack.org/developer/tripleo-docs/troubleshooting/troubles...

     --
     Dan Sneddon         |  Principal OpenStack Engineer
     dsneddon(a)redhat.com <mailto:dsneddon@redhat.com> | 
     redhat.com/openstack <http://redhat.com/openstack>
     650.254.4025 <tel:650.254.4025>        |  dsneddon:irc   @dxs:twitter

     _______________________________________________
     Rdo-list mailing list
     Rdo-list(a)redhat.com <mailto:Rdo-list@redhat.com>
     https://www.redhat.com/mailman/listinfo/rdo-list

     To unsubscribe: rdo-list-unsubscribe(a)redhat.com
     <mailto:rdo-list-unsubscribe@redhat.com>

 -- 
 *Tzach Shefi*
 Quality Engineer, Redhat OSP
 +972-54-4701080 <callto:+972-52-4534729> 
The deployment looks like it is stuck to me. The problem, though,
appears to be an inability to set the power state on one of the VM
nodes through libvirt.

What the SSH driver does for virt is to SSH from the Undercloud VM to
the VM host system, and issue libvirt commands to start/stop VMs. That
process failed when setting the power state of one of your nodes, and
it doesn't look like the deployment is recovering from that error.

I'm not quite sure why that is happening, but I can think of a few
possible reasons:

* SSH daemon not running on the virt host
* The virt host was not able to respond to the request, perhaps it was
overloaded?
* Firewall blocking SSH connections from the Instack VM to the virt host?

One tip for the next deployment: You can set the timeout. That way, if
it does get hung up you don't have to wait 4 hours for it to fail.
Conservatively, you could set --timeout 90 to set the timeout to 90
minutes. A 2-node deployment will definitely either deploy or fail in
that amount of time (probably much less, but I wouldn't want you to cut
off a deployment that might be successful if given a little more time).

-- 
Dan Sneddon         |  Principal OpenStack Engineer
dsneddon(a)redhat.com |  redhat.com/openstack
650.254.4025        |  dsneddon:irc   @dxs:twitter

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Re: [Rdo-list] Overcloud deploy stuck for a long time