[rdo-users] RHOSP 10 failed overcloud deployment

Fri Feb 2 14:36:29 UTC 2018

Hi,

you can check your errors with this command:

#openstack stack failures list overcloud

Like I said, if you want to use vxlans you need to define a tenant work,
the internal network should also be used if you use static ips and want to
bind the services on it, something like this in network-environtment.yaml:

*ServiceNetMap:*
*      NeutronTenantNetwork: internal_api*
*      CeilometerApiNetwork: internal_api*
*      AodhApiNetwork: internal_api*
*      MongoDbNetwork: internal_api*
*      CinderApiNetwork: internal_api*
*      CinderIscsiNetwork: internal_api*
*      GlanceApiNetwork: internal_api*
*      GlanceRegistryNetwork: internal_api*
*      KeystoneAdminApiNetwork: internal_api # allows undercloud to config
endpoints*
*      KeystonePublicApiNetwork: internal_api*
*      NeutronApiNetwork: internal_api*
*      HeatApiNetwork: internal_api*
*      NovaApiNetwork: internal_api*
*      NovaMetadataNetwork: internal_api*
*      NovaVncProxyNetwork: internal_api*
*      SwiftMgmtNetwork: storage_mgmt*
*      SwiftProxyNetwork: storage*
*      SaharaApiNetwork: internal_api*
*      HorizonNetwork: internal_api*
*      MemcachedNetwork: internal_api*
*      RabbitMqNetwork: internal_api*
*      RedisNetwork: internal_api*
*      MysqlNetwork: internal_api*
*      CephClusterNetwork: storage_mgmt*
*      CephPublicNetwork: storage*
*      ControllerHostnameResolveNetwork: internal_api*
*      ComputeHostnameResolveNetwork: internal_api*
*      BlockStorageHostnameResolveNetwork: internal_api*
*      ObjectStorageHostnameResolveNetwork: internal_api*
*      CephStorageHostnameResolveNetwork: storage*

Also make sure you have a dns entry on your nic configuration:

config:
        os_net_config:
          network_config:
            -
              type: interface
              name: nic1
              use_dhcp: false
             * dns_servers: {get_param: DnsServers}*
              addresses:
                -
                  ip_netmask:
                    list_join:
                      - '/'
                      - - {get_param: ControlPlaneIp}
                        - {get_param: ControlPlaneSubnetCidr}

On Fri, Feb 2, 2018 at 1:33 PM, Anda Nicolae <anicolae at lenovo.com> wrote:

> Hi all,
>
>
>
> Thank you very much for your support. I think I am getting pretty close to
> finishing my deployment.
>
> My status now is:
>
> 'openstack stack resource list overcloud' displays all resources in
> CREATE_COMPLETE state, with the exception of AllNodesDeploySteps which is
> in CREATE_FAILED state.
>
> resource_status_reason is Error: resources.AllNodesDeploySteps.resources.
> ControllerDeployment_Step5.resources[0]: Deployment to server failed:
> deploy_status_code: Deployment exited with non-zero status code: 6
>
>
>
> From the json which has status_code 6 in /var/lib/heat-config/deployed on
> Controller node, I have:
>
>
>
> deploy_stdout: Dependency Service[aodh-api] has failures:
> true\u001b[0m\n\u001b[mNotice: /Stage[main]/Keystone::Deps/
> Anchor[keystone::service::end]: Dependency Service[ceilometer-api] has
> failures: true\u001b[0m\n\u001b[mNotice: /Stage[main]/Keystone::Deps/
> Anchor[keystone::service::end]: Dependency Service[gnocchi-api] has
> failures: true\u001b[0m\n\u001b[mNotice: /Stage[main]/Keystone::Deps/
> Anchor[keystone::service::end]: Dependency Service[aodh-api] has
> failures: true\u001b[0m\n\u001b[mNotice: /Stage[main]/Neutron::
> Keystone::Auth/Keystone::Resource::Service_identity[
> neutron]/Keystone_user[neutron]: Dependency Service[ceilometer-api] has
> failures: true\u001b[0m\n\u001b[mNotice: /Stage[main]/Neutron::
> Keystone::Auth/Keystone::Resource::Service_identity[
> neutron]/Keystone_user[neutron]: Dependency Service[gnocchi-api] has
> failures: true\u001b[0m\n\u001b[mNotice: /Stage[main]/Neutron::
> Keystone::Auth/Keystone::Resource::Service_identity[
> neutron]/Keystone_user[neutron]:
>
>
>
> deploy_stderr:  Could not look up qualified variable
> '::nova::api::admin_user'; class ::nova::api has not been
> evaluated\u001b[0m\n\u001b[1;31mWarning: Scope(Class[Nova::Keystone::Authtoken]):
> Could not look up qualified variable '::nova::api::admin_password'; class
> ::nova::api has not been evaluated\u001b[0m\n\u001b[1;31mWarning:
> Scope(Class[Nova::Keystone::Authtoken]): Could not look up qualified
> variable '::nova::api::admin_tenant_name'; class ::nova::api has not been
> evaluated\u001b[0m\n\u001b[1;31mWarning: Scope(Class[Nova::Keystone::Authtoken]):
> Could not look up qualified variable '::nova::api::auth_uri'; class
> ::nova::api has not been evaluated\u001b[0m\n\u001b[1;31mWarning:
> Scope(Class[Nova::Keystone::Authtoken]): Could not look up qualified
> variable '::nova::api::auth_version'; class ::nova::api has not been
> evaluated\u001b[0m\n\u001b[1;31mWarning: Scope(Class[Nova::Keystone::Authtoken]):
> Could not look up qualified variable '::nova::api::identity_uri'; class
> ::nova::api has not been evaluated
>
>
>
> I've estimated that my deployment failed after half an hour, not after 4
> hours like it did before.
>
>
>
> I think my deployment failed because I haven't defined yet
> InternalApiNetCidr and TenantNetCidr.
>
> Next step will be to define these in in network-environment.yaml.
>
> I will use static IP addresses for both InternalApiNetCidr and
> TenantNetCidr and I will add these static IP addresses in
> ips_from_pool_all.yaml.
>
> I will also define ../network/ports/internal_api_from_pool.yaml and
> ../network/ports/tenant_from_pool.yaml.
>
> Please let me know whether you have other ideas why my deployment fails.
>
>
>
>
>
>
>
> To get here, I have added the following lines in both controller.yaml and
> compute.yaml:
>
> routes:
>
>   -
>
>               ip_netmask: 169.254.169.254/32
>
>     next_hop: {get_param: EC2MetadataIp}
>
>   -
>
>
>
> and the lines:
>
>   -
>
>     type: interface
>
>     name: eth1
>
>     use_dhcp: false # This effectively disables NIC1
>
>   -
>
>     type: interface
>
>     name: eth2
>
>     use_dhcp: false # This effectively disables NIC2
>
>
>
> On both the controller and the compute overcloud VMs, I have the following
> routing table:
>
> Kernel IP routing table
>
> Destination                                     Gateway
>
> Genmask                          Flags Metric Ref    Use    Iface
>
> 0.0.0.0                                             <External Interface
> Gateway IP>                            0.0.0.0
> UG     0          0        0        br-ex
>
> <External CIDR>                            0.0.0.0
>
> 255.255.255.128             U        0          0        0        br-ex
>
> <Provision CIDR>                           0.0.0.0
>
> 255.255.255.0                  U        0          0        0        eth3
>
> 169.254.169.254                            <Undercloud Provision IP>
>                                255.255.255.255             UGH  0
>     0        0        eth3
>
>
>
> Thanks,
>
> Anda
>
>
>
> *From:* Pedro Sousa [mailto:pgsousa at gmail.com]
> *Sent:* Friday, February 2, 2018 1:22 PM
>
> *To:* Anda Nicolae
> *Cc:* rasca at redhat.com; users at lists.rdoproject.org
> *Subject:* Re: [rdo-users] RHOSP 10 failed overcloud deployment
>
>
>
> Hi Anda,
>
>
>
> all the issues seem to related, if you're using tunneled networks you need
> to configure  tenant networks on both controller and computes.
>
>
>
> Also if you're using static ips you should have internal networks defined
> and bind them on ServiceNetMap.
>
>
>
> In the compute nodes if you don't use external network make sure you have
> the default route and 169.254.169.254/32 on ctlplane network, something
> like this:
>
>
>
> *network_config:*
>
> *            -*
>
> *              type: interface*
>
> *              name: nic1*
>
> *              use_dhcp: false*
>
> *              dns_servers: {get_param: DnsServers}*
>
> *              addresses:*
>
> *                -*
>
> *                  ip_netmask:*
>
> *                    list_join:*
>
> *                      - '/'*
>
> *                      - - {get_param: ControlPlaneIp}*
>
> *                        - {get_param: ControlPlaneSubnetCidr}*
>
> *              routes:*
>
> *                -*
>
> *                  ip_netmask: 169.254.169.254/32
> <http://169.254.169.254/32>*
>
> *                  next_hop: {get_param: EC2MetadataIp}*
>
> *                -*
>
> *                  default: true*
>
> *                  next_hop: {get_param: ControlPlaneDefaultRoute}  *
>
>
>
> Hope it helps.
>
>
>
>
>
>
>
>
>
> On Fri, Feb 2, 2018 at 9:04 AM, Anda Nicolae <anicolae at lenovo.com> wrote:
>
> Hi all,
>
>
>
> Thanks for the info about the 2 networks (external and ctlplane) that I
> need on the overcloud VMs (controller and compute).
>
> Now br-ex on my overcloud VMs has the external IP address and I am able to
> ping overcloud VMs on both external and ctlplane IP addresses.
>
>
>
> Also, since for the external network I use static IPs, in my
> ips-from-pool-all.yaml, I have:
>
> OS::TripleO::Compute::Ports::ExternalPort: ../network/ports/external_
> from_pool_compute.yaml
>
>
>
> external_from_pool_compute.yaml is similar to external_from_pool.yaml
> file. I've noticed that I if use noop.yaml, the external IP is not assigned
> to eth0 interface on the compute node.
>
> I hope it is correct to use it like this.
>
>
>
> I have continued with my overcloud deployment and I've noticed that some
> progress has been made:
>
> - Controller resource is now in CREATE_COMPLETE state
>
> - although deployment still fails, I can connect to the overcloud VMs via
> both ctlplane IP and external IP and check the logs, after the failure of
> the deploy operation
>
>
>
> Compute resource fails with the CREATE aborted reason. I've looked in
> /valog/messages on the overcloud compute VM and I've noticed the following
> error messages that keep repeating:
>
> Feb  2 03:09:36 localhost os-collect-config: Source [ec2] Unavailable.
>
> Feb  2 03:09:36 localhost os-collect-config: /var/lib/os-collect-config/local-data
> not found. Skipping
>
> Feb  2 03:09:36 localhost os-collect-config: No local metadata found
> (['/var/lib/os-collect-config/local-data'])
>
> Feb  2 03:10:16 localhost os-collect-config: HTTPConnectionPool(host='169.254.169.254',
> port=80): Max retries exceeded with url: /latest/meta-data/ (Caused by
> ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection
> object at 0x2752190>, 'Connection to 169.254.169.254 timed out. (connect
> timeout=10.0)'))
>
>
>
>
>
> From heat-engine.log, I have:
>
> 2018-02-01 19:26:32.253 3348 DEBUG neutronclient.v2_0.client
> [req-c27f050c-b743-4e1d-a706-e01e63a43b49 fdfcf2f659a94e57829dbefc618f3d3b
> 453c1e37b83f4f8e8a49dab299e8224d - - -] Error message: {"NeutronError":
> {"message": "Port 0292b718-2c28-4b0c-a517-c481c547b711 could not be
> found.", "type": "PortNotFound", "detail": ""}} _handle_fault_response
> /usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py:266
>
>
>
>
>
> I have 2 questions regarding the deployment:
>
> 1. Does any of the error messages above cause the failed deployment of the
> Compute resource?
>
> 2. In my network-environment.yaml, I haven't set InternalApiNetCidr,
> TenantNetCidr, InternalApiNetworkVlanID, TenantNetworkVlanID.
>
> Do I need to set these in order to make de overcloud deployment work?
>
>
>
> Thanks,
>
> Anda
>
>
>
>
>
> *From:* Anda Nicolae
> *Sent:* Wednesday, January 31, 2018 12:40 PM
> *To:* 'Pedro Sousa'
> *Cc:* rasca at redhat.com; users at lists.rdoproject.org
> *Subject:* RE: [rdo-users] RHOSP 10 failed overcloud deployment
>
>
>
> I've just run 'neutron net-list' on the undercloud node and I have the 2
> networks, ctlplane and external.
>
> My belief was that I don't need the external network, I only need the
> provision (ctlplane) network for the deployment.
>
> I don't have a DHCP server for my external network.
>
>
>
> Do I need to set the external IP address for the compute node and for the
> controller node in the yaml files from templates folder?
>
>
>
> Thanks,
>
> Anda
>
>
>
> *From:* Pedro Sousa [mailto:pgsousa at gmail.com <pgsousa at gmail.com>]
> *Sent:* Wednesday, January 31, 2018 12:32 PM
> *To:* Anda Nicolae
> *Cc:* rasca at redhat.com; users at lists.rdoproject.org
>
>
> *Subject:* Re: [rdo-users] RHOSP 10 failed overcloud deployment
>
>
>
> Hi Anda,
>
>
>
> some things you could check:
>
>
>
> Do you have 2 networks on director (ctlplane and external) and are they
> reachable from the overcloud nodes?
>
>
>
> Seems to me that you have network issues and that's because you're seeing
> those long timeouts.
>
>
>
> For "Message: No valid host was found. There are not enough hosts
> available" message you could check "/var/log/nova/nova-conductor.log".
>
>
>
> Regards
>
>
>
>
>
> On Wed, Jan 31, 2018 at 10:14 AM, Anda Nicolae <anicolae at lenovo.com>
> wrote:
>
> I've let the deployment run overnight and it failed after almost 4hrs with
> the errors below. Do you happen to know the config file where I can
> decrease the timeout? I looked in /etc/nova/nova.conf and in ironic config
> files but I couldn't find anything relevant.
>
> The errors are:
>
> [overcloud.Compute.0]: CREATE_FAILED  ResourceInError:
> resources[0].resources.NovaCompute: Went to status ERROR due to "Message:
> Unknown, Code: Unknown"
> [overcloud.Controller.0]: CREATE_FAILED  Resource CREATE failed:
> ResourceInError: resources.Controller: Went to status ERROR due to
> "Message: No valid host was found. There are not enough hosts available.,
> Code: 500"
>
> It is unclear to me why the above errors occur, since in my
> instackenv.json I declared node capabilities for both the computer and the
> controller node to be greater than the compute and controller flavors from
> 'openstack flavor list'.
>
> However, I've found this link and I am looking over it:
> https://docs.openstack.org/ironic/latest/admin/troubleshooting.html#nova-
> returns-no-valid-host-was-found-error
>
> Thanks,
> Anda
>
> -----Original Message-----
> From: Raoul Scarazzini [mailto:rasca at redhat.com]
> Sent: Tuesday, January 30, 2018 8:17 PM
> To: Anda Nicolae; users at lists.rdoproject.org
> Subject: Re: [rdo-users] RHOSP 10 failed overcloud deployment
>
> On 01/30/2018 04:39 PM, Anda Nicolae wrote:
> > Got it.
> >
> > I've noticed that it spends quite some time in CREATE_IN_PROGRESS state
> for OS::Heat::ResourceGroup resource (on Controller node).
> > Overcloud deployment fails after 4h. I will check in which config file
> is the overcloud deployment timeout configured and decrease it.
> >
> > Thanks,
> > Anda
>
> Check also network settings. 4h timeout is the default when something is
> unreachable.
>
> --
> Raoul Scarazzini
> rasca at redhat.com
> _______________________________________________
> users mailing list
> users at lists.rdoproject.org
> http://lists.rdoproject.org/mailman/listinfo/users
>
> To unsubscribe: users-unsubscribe at lists.rdoproject.org
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rdoproject.org/pipermail/users/attachments/20180202/07b8e245/attachment-0001.html>