Re: [Rdo-list] Overcloud deploy stuck for a long time

Thursday, 18 February 2016

Hi,

I have seen the same issues when deploying on HP Blades. I had chosen to 
deploy on a subset of blades to save time whilst testing. The error was 
caused by a rogue blade.
Previous attempts on a different set of blades in the same chassis had 
left a blade(s) powered on and therefore presenting duplicate ip 
addresses in the blade cluster which was interfering with my new deployment.
Basically check that all of your nodes are in the correct state, i.e 
look in the iLO and cross reference with Ironic and check the power state.

HTH

Charles

On 14/10/2015 12:40, Udi Kalifon wrote:
...
 My overcloud deployment also hangs for 4 hours and then fails. This
is 
 what I got on the 1st run:

 [stack@instack ~]$ openstack overcloud deploy --templates
 Deploying templates in the directory 
 /usr/share/openstack-tripleo-heat-templates
 ERROR: Authentication failed. Please try again with option 
 --include-password or export HEAT_INCLUDE_PASSWORD=1
 Authentication required

 I am assuming the authentication error is due to the expiration of the 
 token after 4 hours, and not because I forgot the rc file. I tried to 
 run the deployment again and it failed after another 4 hours with a 
 different error:

 [stack@instack ~]$ openstack overcloud deploy --templates
 Deploying templates in the directory 
 /usr/share/openstack-tripleo-heat-templates
 Stack failed with status: resources.Controller: resources[0]: 
 ResourceInError: resources.Controller: Went to status ERROR due to 
 "Message: Exceeded maximum number of retries. Exceeded max scheduling 
 attempts 3 for instance 9eedda9e-f381-47d4-a883-0fe40db0eb5e. Last 
 exception: [u'Traceback (most recent call last):\n', u'  File 
 "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1, 
 Code: 500"
 Heat Stack update failed.

 The failed resources are:

 heat resource-list -n 5 overcloud |egrep -v COMPLETE

+-------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------+
 | resource_name                             | 
 physical_resource_id                          | 
 resource_type                                     | resource_status | 
 updated_time        | stack_name |

+-------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------+
 | Compute                                   | 
 aee2604f-2580-44c9-bc38-45046970fd63          | 
 OS::Heat::ResourceGroup                           | UPDATE_FAILED   | 
 2015-10-14T06:32:34 | overcloud |
 | 0                                         | 
 2199c1c6-60ca-42a4-927c-8bf0fb8763b7          | 
 OS::TripleO::Compute                              | UPDATE_FAILED   | 
 2015-10-14T06:32:36 | overcloud-Compute-dq426vplp2nu |
 | Controller                                | 
 2ae19a5f-f88c-4d8b-98ec-952657b70cd6          | 
 OS::Heat::ResourceGroup                           | UPDATE_FAILED   | 
 2015-10-14T06:32:36 | overcloud |
 | 0                                         | 
 2fc3ed0c-da5c-45e4-a255-4b4a8ef58dd7          | 
 OS::TripleO::Controller                           | UPDATE_FAILED   | 
 2015-10-14T06:32:38 | overcloud-Controller-ktbqsolaqm4u |
 | NovaCompute                               | 
 7938bbe0-ab97-499f-8859-15f903e7c09b          | 
 OS::Nova::Server                                  | CREATE_FAILED   | 
 2015-10-14T06:32:55 | overcloud-Compute-dq426vplp2nu-0-4acm6pstctor |
 | Controller                                | 
 c1cd6b72-ec0d-4c13-b21c-10d0f6c45788          | 
 OS::Nova::Server                                  | CREATE_FAILED   | 
 2015-10-14T06:32:58 | overcloud-Controller-ktbqsolaqm4u-0-d76rtersrtyt |

+-------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------+

 I was unable to run resource-show or deployment-show on the failed 
 resources, it kept complaining that those resources are not found.

 Thanks,
 Udi.

 On Wed, Oct 14, 2015 at 11:16 AM, Tzach Shefi <tshefi(a)redhat.com 
 <mailto:tshefi@redhat.com>> wrote:

     Hi  Sasha\Dan,
     Yep that's my bug I opened yesterday about this.

     sshd and firewall rules look OK having tested below:
     I can ssh into the virt host from my laptop with root user,
     checking 10.X.X.X net
     Can also ssh from instack vm to virt host, checking 192.168.122.X
     net.

     Unless I should check ssh with other user, if so which ?
     I doubt ssh user/firewall caused the problem as controller was
     installed successfully and it too uses same procedure ssh virt
     power-on method.

     Deployment is still up & stuck if any one ones to take a look
     contact me for access details in private.

     Will review/use  virt console, virt journal and timeout tips on
     next deployment.

     Thanks
     Tzach

     On Wed, Oct 14, 2015 at 5:07 AM, Sasha Chuzhoy <sasha(a)redhat.com
     <mailto:sasha@redhat.com>> wrote:

         I hit the same (or similar) issue on my BM environment, though
         I manage to complete the 1+1 deployment on VM successfully.
         I see it's reported already:
         https://bugzilla.redhat.com/show_bug.cgi?id=1271289

         Ran a deployment with:   openstack overcloud deploy
         --templates --timeout 90 --compute-scale 3 --control-scale 1
         The deployment fails, and I see that "all minus one" overcloud
         nodes are still in BUILD status.

         [stack@undercloud ~]$ nova list

+--------------------------------------+-------------------------+--------+------------+-------------+---------------------+
         | ID                                   | Name               |
         Status | Task State | Power State | Networks            |

+--------------------------------------+-------------------------+--------+------------+-------------+---------------------+
         | b15f499e-79ed-46b2-b990-878dbe6310b1 |
         overcloud-controller-0  | BUILD  | spawning   | NOSTATE     |
         ctlplane=192.0.2.23 |
         | 4877d14a-e34e-406b-8005-dad3d79f5bab |
         overcloud-novacompute-0 | ACTIVE | -          | Running     |
         ctlplane=192.0.2.9  |
         | 0fd1a7ed-367e-448e-8602-8564bf087e92 |
         overcloud-novacompute-1 | BUILD  | spawning   | NOSTATE     |
         ctlplane=192.0.2.21 |
         | 51630a7d-c140-47b9-a071-1f2fdb45f4b4 |
         overcloud-novacompute-2 | BUILD  | spawning   | NOSTATE     |
         ctlplane=192.0.2.22 |

         Will try to investigate further tomorrow.

         Best regards,
         Sasha Chuzhoy.

         ----- Original Message -----
         > From: "Tzach Shefi" <tshefi(a)redhat.com
         <mailto:tshefi@redhat.com>>
         > To: "Dan Sneddon" <dsneddon(a)redhat.com
         <mailto:dsneddon@redhat.com>>
         > Cc: rdo-list(a)redhat.com <mailto:rdo-list@redhat.com>
         > Sent: Tuesday, October 13, 2015 6:01:48 AM
         > Subject: Re: [Rdo-list] Overcloud deploy stuck for a long time
         >
         > So gave it a few more hours, on heat resource nothing is
         failed only
         > create_complete and some init_complete.
         >
         > Nova show
         > | 61aaed37-4993-4165-93a7-3c9bf6b10a21 |
         overcloud-controller-0 | ACTIVE | -
         > | | Running | ctlplane=192.0.2.8 |
         > | 7f9f4f52-3ee6-42d9-9275-ff88582dd6e7 |
         overcloud-novacompute-0 | BUILD |
         > | spawning | NOSTATE | ctlplane=192.0.2.9 |
         >
         >
         > nova show 7f9f4f52-3ee6-42d9-9275-ff88582dd6e7
         >

+--------------------------------------+----------------------------------------------------------+
         > | Property | Value |
         >

+--------------------------------------+----------------------------------------------------------+
         > | OS-DCF:diskConfig | MANUAL |
         > | OS-EXT-AZ:availability_zone | nova |
         > | OS-EXT-SRV-ATTR:host | instack.localdomain |
         > | OS-EXT-SRV-ATTR:hypervisor_hostname |
         4626bf90-7f95-4bd7-8bee-5f5b0a0981c6
         > | |
         > | OS-EXT-SRV-ATTR:instance_name | instance-00000002 |
         > | OS-EXT-STS:power_state | 0 |
         > | OS-EXT-STS:task_state | spawning |
         > | OS-EXT-STS:vm_state | building |
         >
         > Checking nova log this is what I see:
         >
         > nova-compute.log:{"nodes": [{"target_power_state":
null,
         "links": [{"href": "
         >
         http://192.0.2.1:6385/v1/nodes/4626bf90-7f95-4bd7-8bee-5f5b0a0981c6
         ",
         > "rel": "self"}, {"href": "
         >
         http://192.0.2.1:6385/nodes/4626bf90-7f95-4bd7-8bee-5f5b0a0981c6
         ", "rel":
         > "bookmark"}], "extra": {}, "last_error":
" Failed to change
         power state to
         > 'power on'. Error: Failed to execute command via SSH : LC_ALL=C
         > /usr/bin/virsh --connect qemu:///system start baremetalbrbm_1.",
         > "updated_at": "2015-10-12T14:36:08+00:00",
         "maintenance_reason": null,
         > "provision_state": "deploying", "clean_step":
{}, "uuid":
         > "4626bf90-7f95-4bd7-8bee-5f5b0a0981c6",
"console_enabled":
         false,
         > "target_provision_state": "active",
"provision_updated_at":
         > "2015-10-12T14:35:18+00:00", "power_state": "power
off",
         > "inspection_started_at": null, "inspection_finished_at":
null,
         > "maintenance": false, "driver": "pxe_ssh",
"reservation": null,
         > "properties": {"memory_mb": "4096",
"cpu_arch": "x86_64",
         "local_gb": "40",
         > "cpus": "1", "capabilities":
"boot_option:local"},
         "instance_uuid":
         > "7f9f4f52-3ee6-42d9-9275-ff88582dd6e7", "name": null,
         "driver_info":
         > {"ssh_username": "root", "deploy_kernel":
         > "94cc528d-d91f-4ca7-876e-2d8cbec66f1b",
"deploy_ramdisk":
         > "057d3b42-002a-4c24-bb3f-2032b8086108",
"ssh_key_contents":
         "-----BEGIN( I
         > removed key..)END RSA PRIVATE KEY-----", "ssh_virt_type":
         "virsh",
         > "ssh_address": "192.168.122.1"},
"created_at":
         "2015-10-12T14:26:30+00:00",
         > "ports": [{"href": "
         >
         http://192.0.2.1:6385/v1/nodes/4626bf90-7f95-4bd7-8bee-5f5b0a0981c6/ports
         ",
         > "rel": "self"}, {"href": "
         >
         http://192.0.2.1:6385/nodes/4626bf90-7f95-4bd7-8bee-5f5b0a0981c6/ports
         ",
         > "rel": "bookmark"}], "driver_internal_info":
{"clean_steps":
         null,
         > "root_uuid_or_disk_id":
"9ff90423-9d18-4dd1-ae96-a4466b52d9d9",
         > "is_whole_disk_image": false}, "instance_info":
{"ramdisk":
         > "82639516-289d-4603-bf0e-8131fa75ec46", "kernel":
         > "665ffcb0-2afe-4e04-8910-45b92826e328", "root_gb":
"40",
         "display_name":
         > "overcloud-novacompute-0", "image_source":
         > "d99f460e-c6d9-4803-99e4-51347413f348", "capabilities":
         "{\"boot_option\":
         > \"local\"}", "memory_mb": "4096",
"vcpus": "1", "deploy_key":
         > "BI0FRWDTD4VGHII9JK2BYDDFR8WB1WUG", "local_gb":
"40",
         "configdrive":
         >

"H4sICGDEG1YC/3RtcHpwcWlpZQDt3WuT29iZ2HH02Bl7Fe/G5UxSqS3vLtyesaSl2CR4p1zyhk2Ct+ateScdVxcIgiR4A5sAr95xxa/iVOUz7EfJx8m7rXyE5IDslro1mpbGox15Zv6/lrpJ4AAHN/LBwXMIShIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADhJpvx+5UQq5EqNtvzldGs+MIfewJeNv53f/7n354F6xT/3v/TjH0v/chz0L5+8Gv2f3V+n0s+Pz34u/dj982PJfvSTvxFVfXQ7vfyBlRfGvOZo+kQuWWtNVgJn/jO/d6kHzvrGWlHOjGn0TDfmjmXL30kZtZSrlXPFREaVxQM5Hon4fdl0TU7nCmqtU6urRTlZVRP1clV+knwqK/F4UFbPOuVGKZNKFNTbgVFvwO+PyPmzipqo1solX/6slszmCuKozBzKuKPdMlE5ma
         >
         >
         > Any ideas on how to resolve a stuck spawning compute node,
         it's stuck hasn't
         > changed for a few hours now.
         >
         > Tzach
         >
         > Tzach
         >
         >
         > On Mon, Oct 12, 2015 at 11:25 PM, Dan Sneddon <
         dsneddon(a)redhat.com <mailto:dsneddon@redhat.com> > wrote:
         >
         >
         >
         > On 10/12/2015 08:10 AM, Tzach Shefi wrote:
         > > Hi,
         > >
         > > Server running centos 7.1, vm running for undercloud got up to
         > > overcloud deploy stage.
         > > It looks like its stuck nothing advancing for a while.
         > > Ideas, what to check?
         > >
         > > [stack@instack ~]$ openstack overcloud deploy --templates
         > > Deploying templates in the directory
         > > /usr/share/openstack-tripleo-heat-templates
         > > [91665.696658] device vnet2 entered promiscuous mode
         > > [91665.781346] device vnet3 entered promiscuous mode
         > > [91675.260324] kvm [71183]: vcpu0 disabled perfctr wrmsr:
         0xc1 data 0xffff
         > > [91675.291232] kvm [71200]: vcpu0 disabled perfctr wrmsr:
         0xc1 data 0xffff
         > > [91767.799404] kvm: zapping shadow pages for mmio
         generation wraparound
         > > [91767.880480] kvm: zapping shadow pages for mmio
         generation wraparound
         > > [91768.957761] device vnet2 left promiscuous mode
         > > [91769.799446] device vnet3 left promiscuous mode
         > > [91771.223273] device vnet3 entered promiscuous mode
         > > [91771.232996] device vnet2 entered promiscuous mode
         > > [91773.733967] kvm [72245]: vcpu0 disabled perfctr wrmsr:
         0xc1 data 0xffff
         > > [91801.270510] device vnet2 left promiscuous mode
         > >
         > >
         > > Thanks
         > > Tzach
         > >
         > >
         > > _______________________________________________
         > > Rdo-list mailing list
         > > Rdo-list(a)redhat.com <mailto:Rdo-list@redhat.com>
         > > https://www.redhat.com/mailman/listinfo/rdo-list
         > >
         > > To unsubscribe: rdo-list-unsubscribe(a)redhat.com
         <mailto:rdo-list-unsubscribe@redhat.com>
         > >
         >
         > You're going to need a more complete command line than
         "openstack
         > overcloud deploy --templates". For instance, if you are
         using VMs for
         > your overcloud nodes, you will need to include
         "--libvirt-type qemu".
         > There are probably a couple of other parameters that you
         will need.
         >
         > You can watch the deployment using this command, which will
         show you
         > the progress:
         >
         > watch "heat resource-list -n 5 | grep -v COMPLETE"
         >
         > You can also explore which resources have failed:
         >
         > heat resource-list [-n 5]| grep FAILED
         >
         > And then look more closely at the failed resources:
         >
         > heat resource-show overcloud <resource>
         >
         > There are some more complete troubleshooting instructions here:
         >
         >

http://docs.openstack.org/developer/tripleo-docs/troubleshooting/troubles...
         >
         > --
         > Dan Sneddon | Principal OpenStack Engineer
         > dsneddon(a)redhat.com <mailto:dsneddon@redhat.com> |
         redhat.com/openstack <http://redhat.com/openstack>
         > 650.254.4025 <tel:650.254.4025> | dsneddon:irc @dxs:twitter
         >
         > _______________________________________________
         > Rdo-list mailing list
         > Rdo-list(a)redhat.com <mailto:Rdo-list@redhat.com>
         > https://www.redhat.com/mailman/listinfo/rdo-list
         >
         > To unsubscribe: rdo-list-unsubscribe(a)redhat.com
         <mailto:rdo-list-unsubscribe@redhat.com>
         >
         >
         >
         > --
         > Tzach Shefi
         > Quality Engineer, Redhat OSP
         > +972-54-4701080 <tel:%2B972-54-4701080>
         >
         > _______________________________________________
         > Rdo-list mailing list
         > Rdo-list(a)redhat.com <mailto:Rdo-list@redhat.com>
         > https://www.redhat.com/mailman/listinfo/rdo-list
         >
         > To unsubscribe: rdo-list-unsubscribe(a)redhat.com
         <mailto:rdo-list-unsubscribe@redhat.com>

     -- 
     *Tzach Shefi*
     Quality Engineer, Redhat OSP
     +972-54-4701080 <callto:+972-52-4534729>

     _______________________________________________
     Rdo-list mailing list
     Rdo-list(a)redhat.com <mailto:Rdo-list@redhat.com>
     https://www.redhat.com/mailman/listinfo/rdo-list

     To unsubscribe: rdo-list-unsubscribe(a)redhat.com
     <mailto:rdo-list-unsubscribe@redhat.com>

 _______________________________________________
 Rdo-list mailing list
 Rdo-list(a)redhat.com
 https://www.redhat.com/mailman/listinfo/rdo-list

 To unsubscribe: rdo-list-unsubscribe(a)redhat.com 
-- 
Charles Short
Cloud Engineer
Virtualization and Cloud Team
European Bioinformatics Institute (EMBL-EBI)
Tel: +44 (0)1223 494205

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Re: [Rdo-list] Overcloud deploy stuck for a long time