Hello all,

I tried to deploy RDO Pike without container on our internal plateform.

The setup is pretty simple :
 - 3 Controller in HA
 - 5 Ceph
 - 4 Compute
 - 3 Object-Store

I didn't used any exotic parameter.
This is my deployment command :

openstack overcloud deploy --templates
  -e environement.yaml
  --ntp-server 0.pool.ntp.org
  -e storage-env.yaml
  -e network-env.yaml
  -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-ceph.yaml
  --control-scale 3 --control-flavor control
  --compute-scale 4 --compute-flavor compute
  --ceph-storage-scale 5 --ceph-storage-flavor ceph-storage
  --swift-storage-flavor swift-storage --swift-storage-scale 3
  -e scheduler_hints_env.yaml
  -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml
  -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml

environnement.yaml :
  parameter_defaults:
  ControllerCount: 3
  ComputeCount: 4
  CephStorageCount: 5
  OvercloudCephStorageFlavor: ceph-storage
  CephDefaultPoolSize: 3
  ObjectStorageCount: 3

network-env.yaml :
  resource_registry:
  OS::TripleO::Compute::Net::SoftwareConfig: /home/stack/templates/nic-configs/compute.yaml
  OS::TripleO::Controller::Net::SoftwareConfig: /home/stack/templates/nic-configs/controller.yaml
  OS::TripleO::CephStorage::Net::SoftwareConfig: /home/stack/templates/nic-configs/ceph-storage.yaml
  OS::TripleO::ObjectStorage::Net::SoftwareConfig: /home/stack/templates/nic-configs/swift-storage.yaml

parameter_defaults:
  InternalApiNetCidr: 172.16.0.0/24
  TenantNetCidr: 172.17.0.0/24
  StorageNetCidr: 172.18.0.0/24
  StorageMgmtNetCidr: 172.19.0.0/24
  ManagementNetCidr: 172.20.0.0/24
  ExternalNetCidr: 10.41.11.0/24
  InternalApiAllocationPools: [{'start': '172.16.0.10', 'end': '172.16.0.200'}]
  TenantAllocationPools: [{'start': '172.17.0.10', 'end': '172.17.0.200'}]
  StorageAllocationPools: [{'start': '172.18.0.10', 'end': '172.18.0.200'}]
  StorageMgmtAllocationPools: [{'start': '172.19.0.10', 'end': '172.19.0.200'}]
  ManagementAllocationPools: [{'start': '172.20.0.10', 'end': '172.20.0.200'}]
  # Leave room for floating IPs in the External allocation pool
  ExternalAllocationPools: [{'start': '10.41.11.10', 'end': '10.41.11.30'}]
  # Set to the router gateway on the external network
  ExternalInterfaceDefaultRoute: 10.41.11.254
  # Gateway router for the provisioning network (or Undercloud IP)
  ControlPlaneDefaultRoute: 192.168.131.253
  # The IP address of the EC2 metadata server. Generally the IP of the Undercloud
  EC2MetadataIp: 192.0.2.1
  # Define the DNS servers (maximum 2) for the overcloud nodes
  DnsServers: ["10.38.5.26"]
  InternalApiNetworkVlanID: 202
  StorageNetworkVlanID: 203
  StorageMgmtNetworkVlanID: 204
  TenantNetworkVlanID: 205
  ManagementNetworkVlanID: 206
  ExternalNetworkVlanID: 198
  NeutronExternalNetworkBridge: "''"
  ControlPlaneSubnetCidr: '24'
  BondInterfaceOvsOptions:
      "mode=balance-xor"

storage-env.yaml :
parameter_defaults:
  ExtraConfig:
    ceph::profile::params::osds:
        '/dev/sdb': {}
        '/dev/sdc': {}
        '/dev/sdd': {}
        '/dev/sde': {}
        '/dev/sdf': {}
        '/dev/sdg': {}
  SwiftRingBuild: false
  RingBuild: false


scheduler_hints_env.yaml
parameter_defaults:
    ControllerSchedulerHints:
        'capabilities:node': 'control-%index%'
    NovaComputeSchedulerHints:
        'capabilities:node': 'compute-%index%'
    CephStorageSchedulerHints:
        'capabilities:node': 'ceph-storage-%index%'
    ObjectStorageSchedulerHints:
        'capabilities:node': 'swift-storage-%index%'

After a little use, I found that I found that one controller is unable to get active ha-router and I got this output :

neutron l3-agent-list-hosting-router XXX
+--------------------------------------+------------------------------------+----------------+-------+----------+
| id                                   | host                               | admin_state_up | alive | ha_state |
+--------------------------------------+------------------------------------+----------------+-------+----------+
| 420a7e31-bae1-4f8c-9438-97839cf190c4 | overcloud-controller-0.localdomain | True           | :-)   | standby  |
| 6a943aa5-6fd1-4b44-8557-f0043b266a2f | overcloud-controller-1.localdomain | True           | :-)   | standby  |
| dd66ef16-7533-434f-bf5b-25e38c51375f | overcloud-controller-2.localdomain | True           | :-)   | standby  |
+--------------------------------------+------------------------------------+----------------+-------+----------+

So each time a router is schedule on this controller I can't get an active router. I tried to compare the configuration but everything seems to be good. I redeployed to see if it help, and the only thing that change is the controller where the ha-router are stuck.

The only message that I got is fron OVS :

2017-10-20 08:38:44.930 136145 WARNING neutron.agent.rpc [req-0ad9aec4-f718-498f-9ca7-15b265340174 - - - - -] Device Port(admin_state_up=True,allowed_address_pairs=[],binding=PortBinding,binding_levels=[],created_at=2017-10-20T08:38:38Z,data_plane_status=<?>,description='',device_id='a7e23552-9329-4572-a69d-d7f316fcc5c9',device_owner='network:router_ha_interface',dhcp_options=[],distributed_binding=None,dns=None,fixed_ips=[IPAllocation],id=7b6d81ef-0451-4216-9fe5-52d921052cb7,mac_address=fa:16:3e:13:e9:3c,name='HA port tenant 0ee0af8e94044a42923873939978ed42',network_id=ffe5ffa5-2693-4d35-988e-7290899601e0,project_id='',qos_policy_id=None,revision_number=5,security=PortSecurity(7b6d81ef-0451-4216-9fe5-52d921052cb7),security_group_ids=set([]),status='DOWN',updated_at=2017-10-20T08:38:44Z) is not bound.
2017-10-20 08:38:44.944 136145 WARNING neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-0ad9aec4-f718-498f-9ca7-15b265340174 - - - - -] Device 7b6d81ef-0451-4216-9fe5-52d921052cb7 not defined on plugin or binding failed

Any Idea ?

--

LECOMTE Cedric

Senior software ENgineer

Red Hat

clecomte@redhat.com