A follow up note on this:
I passed docker-ha.yaml and compute-instanceha.yaml along with fencing.yaml
to the CLI at the time of deployment. I hope this is the correct way to
achieve Controller HA and Instance HA in a single deployment. But
evidently, something is wrong here with the compute fencing and unfence
resources. Any helps would be greatly appreciated.
Thank you,
Cody
On Wed, Oct 3, 2018 at 11:46 AM Cody <codeology.lab(a)gmail.com> wrote:
 Hi everyone,
 My cluster is deployed with both Controller and Instance HA. The
 deployment completed without errors, but I noticed something strange from
 the 'pcs status' output from the controllers:
  Clone Set: compute-unfence-trigger-clone [compute-unfence-trigger]
      Started: [ overcloud-novacompute-0 ]
      Stopped: [ overcloud-controller-0 overcloud-controller-1
 overcloud-controller-2 overcloud-novacompute-1 ]
  nova-evacuate    (ocf::openstack:NovaEvacuate):    Started
 overcloud-controller-0
  stonith-fence_ipmilan-002590a2d2c7    (stonith:fence_ipmilan):    Started
 overcloud-controller-1
  stonith-fence_ipmilan-002590a1c641    (stonith:fence_ipmilan):    Started
 overcloud-controller-2
  stonith-fence_ipmilan-002590f25822    (stonith:fence_ipmilan):    Started
 overcloud-controller-0
  stonith-fence_ipmilan-002590f3977a    (stonith:fence_ipmilan):    Started
 overcloud-controller-2
  stonith-fence_ipmilan-002590f2631a    (stonith:fence_ipmilan):    Started
 overcloud-controller-1
 Notice the stonith-fence_ipmilan lines showed incorrect hosts for the last
 two devices. The MAC addresses are for the overcloud-novacompute-0 and
 overcloud-novacompute-1, but it got started on the controller nodes. Is
 this right?
 There are also some failed actions from the status output:
 Failed Actions:
 * overcloud-novacompute-1_start_0 on overcloud-controller-2 'unknown
 error' (1): call=3, status=Timed Out, exitreason='', last-rc-change='Wed
 Oct  3 03:48:55 2018', queued=0ms, exec=0ms
 * overcloud-novacompute-1_start_0 on overcloud-controller-0 'unknown
 error' (1): call=23, status=Timed Out, exitreason='',
last-rc-change='Wed
 Oct  3 14:50:25 2018', queued=0ms, exec=0ms
 * overcloud-novacompute-1_start_0 on overcloud-controller-1 'unknown
 error' (1): call=3, status=Timed Out, exitreason='', last-rc-change='Wed
 Oct  3 03:47:51 2018', queued=0ms, exec=0ms
 I can spin up VMs, but cannot do failover. If I manually trigger a crash
 on one of the compute nodes, the affected VMs will remain at ERROR state
 and the affected compute node will be unable to rejoin the cluster
 afterward.
 After a manual reboot on the affected compute node, it cannot start the
 pcs cluster service. Its container 'nova_compute' also remains unhealthy
 after reboot, with the lastest 'docker logs' message as:
 ++ cat /run_command
 + CMD='/var/lib/nova/instanceha/check-run-nova-compute '
 + ARGS=
 + [[ ! -n '' ]]
 + . kolla_extend_start
 ++ [[ ! -d /var/log/kolla/nova ]]
 +++ stat -c %a /var/log/kolla/nova
 ++ [[ 2755 != \7\5\5 ]]
 ++ chmod 755 /var/log/kolla/nova
 ++ . /usr/local/bin/kolla_nova_extend_start
 +++ [[ ! -d /var/lib/nova/instances ]]
 + echo 'Running command:
 '\''/var/lib/nova/instanceha/check-run-nova-compute '\'''
 Running command: '/var/lib/nova/instanceha/check-run-nova-compute '
 + exec /var/lib/nova/instanceha/check-run-nova-compute
 Waiting for fence-down flag to be cleared
 Waiting for fence-down flag to be cleared
 Waiting for fence-down flag to be cleared
 Waiting for fence-down flag to be cleared
 Waiting for fence-down flag to be cleared
 Waiting for fence-down flag to be cleared
 ...
 So I guess something may be wrong with fencing, but I have no idea what
 caused it and how to fix it. Any helps/suggestions/opinions would be
 greatly appreciated. Thank you very much.
 Regards.
 Cody