[Rdo-list] Overcloud deploy stuck for a long time

Thu Feb 18 16:14:50 UTC 2016

I have deployed successfully on HP Blades and found this issue to be related to the number of interfaces that each blade has presented through Ironic. In other words, ironic will try to provision on a particular NIC, that might be different from the NIC the blade is booting from.

This was discussed here on the list, and the objective is to run ironic introspection, then check that each node has only one NIC (the command name escapes me right now, but was something with ironic) and it is connected to the VLAN you want, and delete the ones that were not correct.
Once that is clear, you should be able to run the deployment command.

IB

__
Ignacio Bravo
LTG Federal, Inc
www.ltgfederal.com <http://www.ltgfederal.com/>

> On Feb 18, 2016, at 11:02 AM, Charles Short <cems at ebi.ac.uk> wrote:
> 
> Hi,
> 
> I have seen the same issues when deploying on HP Blades. I had chosen to deploy on a subset of blades to save time whilst testing. The error was caused by a rogue blade.
> Previous attempts on a different set of blades in the same chassis had left a blade(s) powered on and therefore presenting duplicate ip addresses in the blade cluster which was interfering with my new deployment.
> Basically check that all of your nodes are in the correct state, i.e look in the iLO and cross reference with Ironic and check the power state.
> 
> 
> HTH
> 
> Charles
> 
> On 14/10/2015 12:40, Udi Kalifon wrote:
>> My overcloud deployment also hangs for 4 hours and then fails. This is what I got on the 1st run:
>> 
>> [stack at instack ~]$ openstack overcloud deploy --templates
>> Deploying templates in the directory /usr/share/openstack-tripleo-heat-templates
>> ERROR: Authentication failed. Please try again with option --include-password or export HEAT_INCLUDE_PASSWORD=1
>> Authentication required
>> 
>> I am assuming the authentication error is due to the expiration of the token after 4 hours, and not because I forgot the rc file. I tried to run the deployment again and it failed after another 4 hours with a different error:
>> 
>> [stack at instack ~]$ openstack overcloud deploy --templates
>> Deploying templates in the directory /usr/share/openstack-tripleo-heat-templates
>> Stack failed with status: resources.Controller: resources[0]: ResourceInError: resources.Controller: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exceeded max scheduling attempts 3 for instance 9eedda9e-f381-47d4-a883-0fe40db0eb5e. Last exception: [u'Traceback (most recent call last):\n', u'  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1, Code: 500"
>> Heat Stack update failed.
>> 
>> The failed resources are:
>> 
>> heat resource-list -n 5 overcloud |egrep -v COMPLETE
>> +-------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------+
>> | resource_name                             | physical_resource_id                          | resource_type                                     | resource_status | updated_time        | stack_name                                                                      |
>> +-------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------+
>> | Compute                                   | aee2604f-2580-44c9-bc38-45046970fd63          | OS::Heat::ResourceGroup                           | UPDATE_FAILED   | 2015-10-14T06:32:34 | overcloud                                                                       |
>> | 0                                         | 2199c1c6-60ca-42a4-927c-8bf0fb8763b7          | OS::TripleO::Compute                              | UPDATE_FAILED   | 2015-10-14T06:32:36 | overcloud-Compute-dq426vplp2nu                                                  |
>> | Controller                                | 2ae19a5f-f88c-4d8b-98ec-952657b70cd6          | OS::Heat::ResourceGroup                           | UPDATE_FAILED   | 2015-10-14T06:32:36 | overcloud                                                                       |
>> | 0                                         | 2fc3ed0c-da5c-45e4-a255-4b4a8ef58dd7          | OS::TripleO::Controller                           | UPDATE_FAILED   | 2015-10-14T06:32:38 | overcloud-Controller-ktbqsolaqm4u                                               |
>> | NovaCompute                               | 7938bbe0-ab97-499f-8859-15f903e7c09b          | OS::Nova::Server                                  | CREATE_FAILED   | 2015-10-14T06:32:55 | overcloud-Compute-dq426vplp2nu-0-4acm6pstctor                                   |
>> | Controller                                | c1cd6b72-ec0d-4c13-b21c-10d0f6c45788          | OS::Nova::Server                                  | CREATE_FAILED   | 2015-10-14T06:32:58 | overcloud-Controller-ktbqsolaqm4u-0-d76rtersrtyt                                |
>> +-------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------+
>> 
>> 
>> I was unable to run resource-show or deployment-show on the failed resources, it kept complaining that those resources are not found.
>> 
>> Thanks,
>> Udi.
>> 
>> 
>> On Wed, Oct 14, 2015 at 11:16 AM, Tzach Shefi <tshefi at redhat.com <mailto:tshefi at redhat.com>> wrote:
>> Hi  Sasha\Dan, 
>> Yep that's my bug I opened yesterday about this.  
>> 
>> sshd and firewall rules look OK having tested below:
>> I can ssh into the virt host from my laptop with root user, checking 10.X.X.X net
>> Can also ssh from instack vm to virt host, checking 192.168.122.X net. 
>> 
>> Unless I should check ssh with other user, if so which ? 
>> I doubt ssh user/firewall caused the problem as controller was installed successfully and it too uses same procedure ssh virt power-on method. 
>> 
>> Deployment is still up & stuck if any one ones to take a look contact me for access details in private. 
>> 
>> Will review/use  virt console, virt journal and timeout tips on next deployment.  
>> 
>> Thanks
>> Tzach
>> 
>> 
>> On Wed, Oct 14, 2015 at 5:07 AM, Sasha Chuzhoy < <mailto:sasha at redhat.com>sasha at redhat.com <mailto:sasha at redhat.com>> wrote:
>> I hit the same (or similar) issue on my BM environment, though I manage to complete the 1+1 deployment on VM successfully.
>> I see it's reported already:  <https://bugzilla.redhat.com/show_bug.cgi?id=1271289>https://bugzilla.redhat.com/show_bug.cgi?id=1271289 <https://bugzilla.redhat.com/show_bug.cgi?id=1271289>
>> 
>> Ran a deployment with:   openstack overcloud deploy --templates --timeout 90 --compute-scale 3 --control-scale 1
>> The deployment fails, and I see that "all minus one" overcloud nodes are still in BUILD status.
>> 
>> [stack at undercloud ~]$ nova list
>> +--------------------------------------+-------------------------+--------+------------+-------------+---------------------+
>> | ID                                   | Name                    | Status | Task State | Power State | Networks            |
>> +--------------------------------------+-------------------------+--------+------------+-------------+---------------------+
>> | b15f499e-79ed-46b2-b990-878dbe6310b1 | overcloud-controller-0  | BUILD  | spawning   | NOSTATE     | ctlplane=192.0.2.23 |
>> | 4877d14a-e34e-406b-8005-dad3d79f5bab | overcloud-novacompute-0 | ACTIVE | -          | Running     | ctlplane=192.0.2.9  |
>> | 0fd1a7ed-367e-448e-8602-8564bf087e92 | overcloud-novacompute-1 | BUILD  | spawning   | NOSTATE     | ctlplane=192.0.2.21 |
>> | 51630a7d-c140-47b9-a071-1f2fdb45f4b4 | overcloud-novacompute-2 | BUILD  | spawning   | NOSTATE     | ctlplane=192.0.2.22 |
>> 
>> 
>> Will try to investigate further tomorrow.
>> 
>> Best regards,
>> Sasha Chuzhoy.
>> 
>> ----- Original Message -----
>> > From: "Tzach Shefi" < <mailto:tshefi at redhat.com>tshefi at redhat.com <mailto:tshefi at redhat.com>>
>> > To: "Dan Sneddon" < <mailto:dsneddon at redhat.com>dsneddon at redhat.com <mailto:dsneddon at redhat.com>>
>> > Cc: rdo-list at redhat.com <mailto:rdo-list at redhat.com>
>> > Sent: Tuesday, October 13, 2015 6:01:48 AM
>> > Subject: Re: [Rdo-list] Overcloud deploy stuck for a long time
>> >
>> > So gave it a few more hours, on heat resource nothing is failed only
>> > create_complete and some init_complete.
>> >
>> > Nova show
>> > | 61aaed37-4993-4165-93a7-3c9bf6b10a21 | overcloud-controller-0 | ACTIVE | -
>> > | | Running | ctlplane=192.0.2.8 |
>> > | 7f9f4f52-3ee6-42d9-9275-ff88582dd6e7 | overcloud-novacompute-0 | BUILD |
>> > | spawning | NOSTATE | ctlplane=192.0.2.9 |
>> >
>> >
>> > nova show 7f9f4f52-3ee6-42d9-9275-ff88582dd6e7
>> > +--------------------------------------+----------------------------------------------------------+
>> > | Property | Value |
>> > +--------------------------------------+----------------------------------------------------------+
>> > | OS-DCF:diskConfig | MANUAL |
>> > | OS-EXT-AZ:availability_zone | nova |
>> > | OS-EXT-SRV-ATTR:host | instack.localdomain |
>> > | OS-EXT-SRV-ATTR:hypervisor_hostname | 4626bf90-7f95-4bd7-8bee-5f5b0a0981c6
>> > | |
>> > | OS-EXT-SRV-ATTR:instance_name | instance-00000002 |
>> > | OS-EXT-STS:power_state | 0 |
>> > | OS-EXT-STS:task_state | spawning |
>> > | OS-EXT-STS:vm_state | building |
>> >
>> > Checking nova log this is what I see:
>> >
>> > nova-compute.log:{"nodes": [{"target_power_state": null, "links": [{"href": "
>> > http://192.0.2.1:6385/v1/nodes/4626bf90-7f95-4bd7-8bee-5f5b0a0981c6 <http://192.0.2.1:6385/v1/nodes/4626bf90-7f95-4bd7-8bee-5f5b0a0981c6> ",
>> > "rel": "self"}, {"href": "
>> > http://192.0.2.1:6385/nodes/4626bf90-7f95-4bd7-8bee-5f5b0a0981c6 <http://192.0.2.1:6385/nodes/4626bf90-7f95-4bd7-8bee-5f5b0a0981c6> ", "rel":
>> > "bookmark"}], "extra": {}, "last_error": " Failed to change power state to
>> > 'power on'. Error: Failed to execute command via SSH : LC_ALL=C
>> > /usr/bin/virsh --connect qemu:///system start baremetalbrbm_1.",
>> > "updated_at": "2015-10-12T14:36:08+00:00", "maintenance_reason": null,
>> > "provision_state": "deploying", "clean_step": {}, "uuid":
>> > "4626bf90-7f95-4bd7-8bee-5f5b0a0981c6", "console_enabled": false,
>> > "target_provision_state": "active", "provision_updated_at":
>> > "2015-10-12T14:35:18+00:00", "power_state": "power off",
>> > "inspection_started_at": null, "inspection_finished_at": null,
>> > "maintenance": false, "driver": "pxe_ssh", "reservation": null,
>> > "properties": {"memory_mb": "4096", "cpu_arch": "x86_64", "local_gb": "40",
>> > "cpus": "1", "capabilities": "boot_option:local"}, "instance_uuid":
>> > "7f9f4f52-3ee6-42d9-9275-ff88582dd6e7", "name": null, "driver_info":
>> > {"ssh_username": "root", "deploy_kernel":
>> > "94cc528d-d91f-4ca7-876e-2d8cbec66f1b", "deploy_ramdisk":
>> > "057d3b42-002a-4c24-bb3f-2032b8086108", "ssh_key_contents": "-----BEGIN( I
>> > removed key..)END RSA PRIVATE KEY-----", "ssh_virt_type": "virsh",
>> > "ssh_address": "192.168.122.1"}, "created_at": "2015-10-12T14:26:30+00:00",
>> > "ports": [{"href": "
>> > http://192.0.2.1:6385/v1/nodes/4626bf90-7f95-4bd7-8bee-5f5b0a0981c6/ports <http://192.0.2.1:6385/v1/nodes/4626bf90-7f95-4bd7-8bee-5f5b0a0981c6/ports> ",
>> > "rel": "self"}, {"href": "
>> > http://192.0.2.1:6385/nodes/4626bf90-7f95-4bd7-8bee-5f5b0a0981c6/ports <http://192.0.2.1:6385/nodes/4626bf90-7f95-4bd7-8bee-5f5b0a0981c6/ports> ",
>> > "rel": "bookmark"}], "driver_internal_info": {"clean_steps": null,
>> > "root_uuid_or_disk_id": "9ff90423-9d18-4dd1-ae96-a4466b52d9d9",
>> > "is_whole_disk_image": false}, "instance_info": {"ramdisk":
>> > "82639516-289d-4603-bf0e-8131fa75ec46", "kernel":
>> > "665ffcb0-2afe-4e04-8910-45b92826e328", "root_gb": "40", "display_name":
>> > "overcloud-novacompute-0", "image_source":
>> > "d99f460e-c6d9-4803-99e4-51347413f348", "capabilities": "{\"boot_option\":
>> > \"local\"}", "memory_mb": "4096", "vcpus": "1", "deploy_key":
>> > "BI0FRWDTD4VGHII9JK2BYDDFR8WB1WUG", "local_gb": "40", "configdrive":
>> > "H4sICGDEG1YC/3RtcHpwcWlpZQDt3WuT29iZ2HH02Bl7Fe/G5UxSqS3vLtyesaSl2CR4p1zyhk2Ct+ateScdVxcIgiR4A5sAr95xxa/iVOUz7EfJx8m7rXyE5IDslro1mpbGox15Zv6/lrpJ4AAHN/LBwXMIShIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADhJpvx+5UQq5EqNtvzldGs+MIfewJeNv53f/7n354F6xT/3v/TjH0v/chz0L5+8Gv2f3V+n0s+Pz34u/dj982PJfvSTvxFVfXQ7vfyBlRfGvOZo+kQuWWtNVgJn/jO/d6kHzvrGWlHOjGn0TDfmjmXL30kZtZSrlXPFREaVxQM5Hon4fdl0TU7nCmqtU6urRTlZVRP1clV+knwqK/F4UFbPOuVGKZNKFNTbgVFvwO+PyPmzipqo1solX/6slszmCuKozBzKuKPdMlE5ma
>> >
>> >
>> > Any ideas on how to resolve a stuck spawning compute node, it's stuck hasn't
>> > changed for a few hours now.
>> >
>> > Tzach
>> >
>> > Tzach
>> >
>> >
>> > On Mon, Oct 12, 2015 at 11:25 PM, Dan Sneddon < dsneddon at redhat.com <mailto:dsneddon at redhat.com> > wrote:
>> >
>> >
>> >
>> > On 10/12/2015 08:10 AM, Tzach Shefi wrote:
>> > > Hi,
>> > >
>> > > Server running centos 7.1, vm running for undercloud got up to
>> > > overcloud deploy stage.
>> > > It looks like its stuck nothing advancing for a while.
>> > > Ideas, what to check?
>> > >
>> > > [stack at instack ~]$ openstack overcloud deploy --templates
>> > > Deploying templates in the directory
>> > > /usr/share/openstack-tripleo-heat-templates
>> > > [91665.696658] device vnet2 entered promiscuous mode
>> > > [91665.781346] device vnet3 entered promiscuous mode
>> > > [91675.260324] kvm [71183]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xffff
>> > > [91675.291232] kvm [71200]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xffff
>> > > [91767.799404] kvm: zapping shadow pages for mmio generation wraparound
>> > > [91767.880480] kvm: zapping shadow pages for mmio generation wraparound
>> > > [91768.957761] device vnet2 left promiscuous mode
>> > > [91769.799446] device vnet3 left promiscuous mode
>> > > [91771.223273] device vnet3 entered promiscuous mode
>> > > [91771.232996] device vnet2 entered promiscuous mode
>> > > [91773.733967] kvm [72245]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xffff
>> > > [91801.270510] device vnet2 left promiscuous mode
>> > >
>> > >
>> > > Thanks
>> > > Tzach
>> > >
>> > >
>> > > _______________________________________________
>> > > Rdo-list mailing list
>> > > Rdo-list at redhat.com <mailto:Rdo-list at redhat.com>
>> > > https://www.redhat.com/mailman/listinfo/rdo-list <https://www.redhat.com/mailman/listinfo/rdo-list>
>> > >
>> > > To unsubscribe:  <mailto:rdo-list-unsubscribe at redhat.com>rdo-list-unsubscribe at redhat.com <mailto:rdo-list-unsubscribe at redhat.com>
>> > >
>> >
>> > You're going to need a more complete command line than "openstack
>> > overcloud deploy --templates". For instance, if you are using VMs for
>> > your overcloud nodes, you will need to include "--libvirt-type qemu".
>> > There are probably a couple of other parameters that you will need.
>> >
>> > You can watch the deployment using this command, which will show you
>> > the progress:
>> >
>> > watch "heat resource-list -n 5 | grep -v COMPLETE"
>> >
>> > You can also explore which resources have failed:
>> >
>> > heat resource-list [-n 5]| grep FAILED
>> >
>> > And then look more closely at the failed resources:
>> >
>> > heat resource-show overcloud <resource>
>> >
>> > There are some more complete troubleshooting instructions here:
>> >
>> > http://docs.openstack.org/developer/tripleo-docs/troubleshooting/troubleshooting-overcloud.html <http://docs.openstack.org/developer/tripleo-docs/troubleshooting/troubleshooting-overcloud.html>
>> >
>> > --
>> > Dan Sneddon | Principal OpenStack Engineer
>> > dsneddon at redhat.com <mailto:dsneddon at redhat.com> | redhat.com/openstack <http://redhat.com/openstack>
>> > 650.254.4025 <tel:650.254.4025> | dsneddon:irc @dxs:twitter
>> >
>> > _______________________________________________
>> > Rdo-list mailing list
>> > Rdo-list at redhat.com <mailto:Rdo-list at redhat.com>
>> > https://www.redhat.com/mailman/listinfo/rdo-list <https://www.redhat.com/mailman/listinfo/rdo-list>
>> >
>> > To unsubscribe:  <mailto:rdo-list-unsubscribe at redhat.com>rdo-list-unsubscribe at redhat.com <mailto:rdo-list-unsubscribe at redhat.com>
>> >
>> >
>> >
>> > --
>> > Tzach Shefi
>> > Quality Engineer, Redhat OSP
>> > +972-54-4701080 <tel:%2B972-54-4701080>
>> >
>> > _______________________________________________
>> > Rdo-list mailing list
>> > Rdo-list at redhat.com <mailto:Rdo-list at redhat.com>
>> > https://www.redhat.com/mailman/listinfo/rdo-list <https://www.redhat.com/mailman/listinfo/rdo-list>
>> >
>> > To unsubscribe:  <mailto:rdo-list-unsubscribe at redhat.com>rdo-list-unsubscribe at redhat.com <mailto:rdo-list-unsubscribe at redhat.com>
>> 
>> 
>> 
>> -- 
>> Tzach Shefi
>> Quality Engineer, Redhat OSP
>> +972-54-4701080 <callto:+972-52-4534729>
>> _______________________________________________
>> Rdo-list mailing list
>> Rdo-list at redhat.com <mailto:Rdo-list at redhat.com>
>> https://www.redhat.com/mailman/listinfo/rdo-list <https://www.redhat.com/mailman/listinfo/rdo-list>
>> 
>> To unsubscribe: rdo-list-unsubscribe at redhat.com <mailto:rdo-list-unsubscribe at redhat.com>
>> 
>> 
>> 
>> _______________________________________________
>> Rdo-list mailing list
>> Rdo-list at redhat.com <mailto:Rdo-list at redhat.com>
>> https://www.redhat.com/mailman/listinfo/rdo-list <https://www.redhat.com/mailman/listinfo/rdo-list>
>> 
>> To unsubscribe: rdo-list-unsubscribe at redhat.com <mailto:rdo-list-unsubscribe at redhat.com>
> -- 
> Charles Short
> Cloud Engineer
> Virtualization and Cloud Team
> European Bioinformatics Institute (EMBL-EBI)
> Tel: +44 (0)1223 494205 
> _______________________________________________
> Rdo-list mailing list
> Rdo-list at redhat.com
> https://www.redhat.com/mailman/listinfo/rdo-list
> 
> To unsubscribe: rdo-list-unsubscribe at redhat.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rdoproject.org/pipermail/dev/attachments/20160218/161c4efa/attachment.html>