Hi everyone,

My cluster is deployed with both Controller and Instance HA. The deployment completed without errors, but I noticed something strange from the 'pcs status' output from the controllers:

Clone Set: compute-unfence-trigger-clone [compute-unfence-trigger]
     Started: [ overcloud-novacompute-0 ]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 overcloud-novacompute-1 ]
nova-evacuate    (ocf::openstack:NovaEvacuate):    Started overcloud-controller-0
stonith-fence_ipmilan-002590a2d2c7    (stonith:fence_ipmilan):    Started overcloud-controller-1
stonith-fence_ipmilan-002590a1c641    (stonith:fence_ipmilan):    Started overcloud-controller-2
stonith-fence_ipmilan-002590f25822    (stonith:fence_ipmilan):    Started overcloud-controller-0
stonith-fence_ipmilan-002590f3977a    (stonith:fence_ipmilan):    Started overcloud-controller-2
stonith-fence_ipmilan-002590f2631a    (stonith:fence_ipmilan):    Started overcloud-controller-1

Notice the stonith-fence_ipmilan lines showed incorrect hosts for the last two devices. The MAC addresses are for the overcloud-novacompute-0 and overcloud-novacompute-1, but it got started on the controller nodes. Is this right?

There are also some failed actions from the status output:

Failed Actions:
* overcloud-novacompute-1_start_0 on overcloud-controller-2 'unknown error' (1): call=3, status=Timed Out, exitreason='', last-rc-change='Wed Oct 3 03:48:55 2018', queued=0ms, exec=0ms
* overcloud-novacompute-1_start_0 on overcloud-controller-0 'unknown error' (1): call=23, status=Timed Out, exitreason='', last-rc-change='Wed Oct 3 14:50:25 2018', queued=0ms, exec=0ms
* overcloud-novacompute-1_start_0 on overcloud-controller-1 'unknown error' (1): call=3, status=Timed Out, exitreason='', last-rc-change='Wed Oct 3 03:47:51 2018', queued=0ms, exec=0ms

I can spin up VMs, but cannot do failover. If I manually trigger a crash on one of the compute nodes, the affected VMs will remain at ERROR state and the affected compute node will be unable to rejoin the cluster afterward.

After a manual reboot on the affected compute node, it cannot start the pcs cluster service. Its container 'nova_compute' also remains unhealthy after reboot, with the lastest 'docker logs' message as:

++ cat /run_command
+ CMD='/var/lib/nova/instanceha/check-run-nova-compute '
+ ARGS=
+ [[ ! -n '' ]]
+ . kolla_extend_start
++ [[ ! -d /var/log/kolla/nova ]]
+++ stat -c %a /var/log/kolla/nova
++ [[ 2755 != \7\5\5 ]]
++ chmod 755 /var/log/kolla/nova
++ . /usr/local/bin/kolla_nova_extend_start
+++ [[ ! -d /var/lib/nova/instances ]]
+ echo 'Running command: '\''/var/lib/nova/instanceha/check-run-nova-compute '\'''
Running command: '/var/lib/nova/instanceha/check-run-nova-compute '
+ exec /var/lib/nova/instanceha/check-run-nova-compute
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
...

So I guess something may be wrong with fencing, but I have no idea what caused it and how to fix it. Any helps/suggestions/opinions would be greatly appreciated. Thank you very much.

Regards.

Cody