Hi everyone,
My cluster is deployed with both Controller and Instance HA. The deployment
completed without errors, but I noticed something strange from the 'pcs
status' output from the controllers:
Clone Set: compute-unfence-trigger-clone [compute-unfence-trigger]
Started: [ overcloud-novacompute-0 ]
Stopped: [ overcloud-controller-0 overcloud-controller-1
overcloud-controller-2 overcloud-novacompute-1 ]
nova-evacuate (ocf::openstack:NovaEvacuate): Started
overcloud-controller-0
stonith-fence_ipmilan-002590a2d2c7 (stonith:fence_ipmilan): Started
overcloud-controller-1
stonith-fence_ipmilan-002590a1c641 (stonith:fence_ipmilan): Started
overcloud-controller-2
stonith-fence_ipmilan-002590f25822 (stonith:fence_ipmilan): Started
overcloud-controller-0
stonith-fence_ipmilan-002590f3977a (stonith:fence_ipmilan): Started
overcloud-controller-2
stonith-fence_ipmilan-002590f2631a (stonith:fence_ipmilan): Started
overcloud-controller-1
Notice the stonith-fence_ipmilan lines showed incorrect hosts for the last
two devices. The MAC addresses are for the overcloud-novacompute-0 and
overcloud-novacompute-1, but it got started on the controller nodes. Is
this right?
There are also some failed actions from the status output:
Failed Actions:
* overcloud-novacompute-1_start_0 on overcloud-controller-2 'unknown error'
(1): call=3, status=Timed Out, exitreason='', last-rc-change='Wed Oct 3
03:48:55 2018', queued=0ms, exec=0ms
* overcloud-novacompute-1_start_0 on overcloud-controller-0 'unknown error'
(1): call=23, status=Timed Out, exitreason='', last-rc-change='Wed Oct 3
14:50:25 2018', queued=0ms, exec=0ms
* overcloud-novacompute-1_start_0 on overcloud-controller-1 'unknown error'
(1): call=3, status=Timed Out, exitreason='', last-rc-change='Wed Oct 3
03:47:51 2018', queued=0ms, exec=0ms
I can spin up VMs, but cannot do failover. If I manually trigger a crash on
one of the compute nodes, the affected VMs will remain at ERROR state and
the affected compute node will be unable to rejoin the cluster afterward.
After a manual reboot on the affected compute node, it cannot start the pcs
cluster service. Its container 'nova_compute' also remains unhealthy after
reboot, with the lastest 'docker logs' message as:
++ cat /run_command
+ CMD='/var/lib/nova/instanceha/check-run-nova-compute '
+ ARGS=
+ [[ ! -n '' ]]
+ . kolla_extend_start
++ [[ ! -d /var/log/kolla/nova ]]
+++ stat -c %a /var/log/kolla/nova
++ [[ 2755 != \7\5\5 ]]
++ chmod 755 /var/log/kolla/nova
++ . /usr/local/bin/kolla_nova_extend_start
+++ [[ ! -d /var/lib/nova/instances ]]
+ echo 'Running command:
'\''/var/lib/nova/instanceha/check-run-nova-compute '\'''
Running command: '/var/lib/nova/instanceha/check-run-nova-compute '
+ exec /var/lib/nova/instanceha/check-run-nova-compute
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
...
So I guess something may be wrong with fencing, but I have no idea what
caused it and how to fix it. Any helps/suggestions/opinions would be
greatly appreciated. Thank you very much.
Regards.
Cody