A follow up note on this:

I passed docker-ha.yaml and compute-instanceha.yaml along with fencing.yaml to the CLI at the time of deployment. I hope this is the correct way to achieve Controller HA and Instance HA in a single deployment. But evidently, something is wrong here with the compute fencing and unfence resources. Any helps would be greatly appreciated.

Thank you,
Cody

On Wed, Oct 3, 2018 at 11:46 AM Cody <codeology.lab@gmail.com> wrote:
Hi everyone,

My cluster is deployed with both Controller and Instance HA. The deployment completed without errors, but I noticed something strange from the 'pcs status' output from the controllers:

 Clone Set: compute-unfence-trigger-clone [compute-unfence-trigger]
     Started: [ overcloud-novacompute-0 ]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 overcloud-novacompute-1 ]
 nova-evacuate    (ocf::openstack:NovaEvacuate):    Started overcloud-controller-0
 stonith-fence_ipmilan-002590a2d2c7    (stonith:fence_ipmilan):    Started overcloud-controller-1
 stonith-fence_ipmilan-002590a1c641    (stonith:fence_ipmilan):    Started overcloud-controller-2
 stonith-fence_ipmilan-002590f25822    (stonith:fence_ipmilan):    Started overcloud-controller-0
 stonith-fence_ipmilan-002590f3977a    (stonith:fence_ipmilan):    Started overcloud-controller-2
 stonith-fence_ipmilan-002590f2631a    (stonith:fence_ipmilan):    Started overcloud-controller-1

Notice the stonith-fence_ipmilan lines showed incorrect hosts for the last two devices. The MAC addresses are for the overcloud-novacompute-0 and overcloud-novacompute-1, but it got started on the controller nodes. Is this right?

There are also some failed actions from the status output:

Failed Actions:
* overcloud-novacompute-1_start_0 on overcloud-controller-2 'unknown error' (1): call=3, status=Timed Out, exitreason='', last-rc-change='Wed Oct  3 03:48:55 2018', queued=0ms, exec=0ms
* overcloud-novacompute-1_start_0 on overcloud-controller-0 'unknown error' (1): call=23, status=Timed Out, exitreason='', last-rc-change='Wed Oct  3 14:50:25 2018', queued=0ms, exec=0ms
* overcloud-novacompute-1_start_0 on overcloud-controller-1 'unknown error' (1): call=3, status=Timed Out, exitreason='', last-rc-change='Wed Oct  3 03:47:51 2018', queued=0ms, exec=0ms

I can spin up VMs, but cannot do failover. If I manually trigger a crash on one of the compute nodes, the affected VMs will remain at ERROR state and the affected compute node will be unable to rejoin the cluster afterward.

After a manual reboot on the affected compute node, it cannot start the pcs cluster service. Its container 'nova_compute' also remains unhealthy after reboot, with the lastest 'docker logs' message as:

++ cat /run_command
+ CMD='/var/lib/nova/instanceha/check-run-nova-compute '
+ ARGS=
+ [[ ! -n '' ]]
+ . kolla_extend_start
++ [[ ! -d /var/log/kolla/nova ]]
+++ stat -c %a /var/log/kolla/nova
++ [[ 2755 != \7\5\5 ]]
++ chmod 755 /var/log/kolla/nova
++ . /usr/local/bin/kolla_nova_extend_start
+++ [[ ! -d /var/lib/nova/instances ]]
+ echo 'Running command: '\''/var/lib/nova/instanceha/check-run-nova-compute '\'''
Running command: '/var/lib/nova/instanceha/check-run-nova-compute '
+ exec /var/lib/nova/instanceha/check-run-nova-compute
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
...

So I guess something may be wrong with fencing, but I have no idea what caused it and how to fix it. Any helps/suggestions/opinions would be greatly appreciated. Thank you very much.


Regards.
Cody