A follow up note on this:
I passed docker-ha.yaml and compute-instanceha.yaml along with fencing.yaml
to the CLI at the time of deployment. I hope this is the correct way to
achieve Controller HA and Instance HA in a single deployment. But
evidently, something is wrong here with the compute fencing and unfence
resources. Any helps would be greatly appreciated.
Thank you,
Cody
On Wed, Oct 3, 2018 at 11:46 AM Cody <codeology.lab(a)gmail.com> wrote:
Hi everyone,
My cluster is deployed with both Controller and Instance HA. The
deployment completed without errors, but I noticed something strange from
the 'pcs status' output from the controllers:
Clone Set: compute-unfence-trigger-clone [compute-unfence-trigger]
Started: [ overcloud-novacompute-0 ]
Stopped: [ overcloud-controller-0 overcloud-controller-1
overcloud-controller-2 overcloud-novacompute-1 ]
nova-evacuate (ocf::openstack:NovaEvacuate): Started
overcloud-controller-0
stonith-fence_ipmilan-002590a2d2c7 (stonith:fence_ipmilan): Started
overcloud-controller-1
stonith-fence_ipmilan-002590a1c641 (stonith:fence_ipmilan): Started
overcloud-controller-2
stonith-fence_ipmilan-002590f25822 (stonith:fence_ipmilan): Started
overcloud-controller-0
stonith-fence_ipmilan-002590f3977a (stonith:fence_ipmilan): Started
overcloud-controller-2
stonith-fence_ipmilan-002590f2631a (stonith:fence_ipmilan): Started
overcloud-controller-1
Notice the stonith-fence_ipmilan lines showed incorrect hosts for the last
two devices. The MAC addresses are for the overcloud-novacompute-0 and
overcloud-novacompute-1, but it got started on the controller nodes. Is
this right?
There are also some failed actions from the status output:
Failed Actions:
* overcloud-novacompute-1_start_0 on overcloud-controller-2 'unknown
error' (1): call=3, status=Timed Out, exitreason='', last-rc-change='Wed
Oct 3 03:48:55 2018', queued=0ms, exec=0ms
* overcloud-novacompute-1_start_0 on overcloud-controller-0 'unknown
error' (1): call=23, status=Timed Out, exitreason='',
last-rc-change='Wed
Oct 3 14:50:25 2018', queued=0ms, exec=0ms
* overcloud-novacompute-1_start_0 on overcloud-controller-1 'unknown
error' (1): call=3, status=Timed Out, exitreason='', last-rc-change='Wed
Oct 3 03:47:51 2018', queued=0ms, exec=0ms
I can spin up VMs, but cannot do failover. If I manually trigger a crash
on one of the compute nodes, the affected VMs will remain at ERROR state
and the affected compute node will be unable to rejoin the cluster
afterward.
After a manual reboot on the affected compute node, it cannot start the
pcs cluster service. Its container 'nova_compute' also remains unhealthy
after reboot, with the lastest 'docker logs' message as:
++ cat /run_command
+ CMD='/var/lib/nova/instanceha/check-run-nova-compute '
+ ARGS=
+ [[ ! -n '' ]]
+ . kolla_extend_start
++ [[ ! -d /var/log/kolla/nova ]]
+++ stat -c %a /var/log/kolla/nova
++ [[ 2755 != \7\5\5 ]]
++ chmod 755 /var/log/kolla/nova
++ . /usr/local/bin/kolla_nova_extend_start
+++ [[ ! -d /var/lib/nova/instances ]]
+ echo 'Running command:
'\''/var/lib/nova/instanceha/check-run-nova-compute '\'''
Running command: '/var/lib/nova/instanceha/check-run-nova-compute '
+ exec /var/lib/nova/instanceha/check-run-nova-compute
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
...
So I guess something may be wrong with fencing, but I have no idea what
caused it and how to fix it. Any helps/suggestions/opinions would be
greatly appreciated. Thank you very much.
Regards.
Cody