[rdo-users] Problems with Controller and Instance HA

Wed Oct 3 15:46:52 UTC 2018

Hi everyone,

My cluster is deployed with both Controller and Instance HA. The deployment
completed without errors, but I noticed something strange from the 'pcs
status' output from the controllers:

 Clone Set: compute-unfence-trigger-clone [compute-unfence-trigger]
     Started: [ overcloud-novacompute-0 ]
     Stopped: [ overcloud-controller-0 overcloud-controller-1
overcloud-controller-2 overcloud-novacompute-1 ]
 nova-evacuate    (ocf::openstack:NovaEvacuate):    Started
overcloud-controller-0
 stonith-fence_ipmilan-002590a2d2c7    (stonith:fence_ipmilan):    Started
overcloud-controller-1
 stonith-fence_ipmilan-002590a1c641    (stonith:fence_ipmilan):    Started
overcloud-controller-2
 stonith-fence_ipmilan-002590f25822    (stonith:fence_ipmilan):    Started
overcloud-controller-0
 stonith-fence_ipmilan-002590f3977a    (stonith:fence_ipmilan):    Started
overcloud-controller-2
 stonith-fence_ipmilan-002590f2631a    (stonith:fence_ipmilan):    Started
overcloud-controller-1

Notice the stonith-fence_ipmilan lines showed incorrect hosts for the last
two devices. The MAC addresses are for the overcloud-novacompute-0 and
overcloud-novacompute-1, but it got started on the controller nodes. Is
this right?

There are also some failed actions from the status output:

Failed Actions:
* overcloud-novacompute-1_start_0 on overcloud-controller-2 'unknown error'
(1): call=3, status=Timed Out, exitreason='', last-rc-change='Wed Oct  3
03:48:55 2018', queued=0ms, exec=0ms
* overcloud-novacompute-1_start_0 on overcloud-controller-0 'unknown error'
(1): call=23, status=Timed Out, exitreason='', last-rc-change='Wed Oct  3
14:50:25 2018', queued=0ms, exec=0ms
* overcloud-novacompute-1_start_0 on overcloud-controller-1 'unknown error'
(1): call=3, status=Timed Out, exitreason='', last-rc-change='Wed Oct  3
03:47:51 2018', queued=0ms, exec=0ms

I can spin up VMs, but cannot do failover. If I manually trigger a crash on
one of the compute nodes, the affected VMs will remain at ERROR state and
the affected compute node will be unable to rejoin the cluster afterward.

After a manual reboot on the affected compute node, it cannot start the pcs
cluster service. Its container 'nova_compute' also remains unhealthy after
reboot, with the lastest 'docker logs' message as:

++ cat /run_command
+ CMD='/var/lib/nova/instanceha/check-run-nova-compute '
+ ARGS=
+ [[ ! -n '' ]]
+ . kolla_extend_start
++ [[ ! -d /var/log/kolla/nova ]]
+++ stat -c %a /var/log/kolla/nova
++ [[ 2755 != \7\5\5 ]]
++ chmod 755 /var/log/kolla/nova
++ . /usr/local/bin/kolla_nova_extend_start
+++ [[ ! -d /var/lib/nova/instances ]]
+ echo 'Running command:
'\''/var/lib/nova/instanceha/check-run-nova-compute '\'''
Running command: '/var/lib/nova/instanceha/check-run-nova-compute '
+ exec /var/lib/nova/instanceha/check-run-nova-compute
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
...

So I guess something may be wrong with fencing, but I have no idea what
caused it and how to fix it. Any helps/suggestions/opinions would be
greatly appreciated. Thank you very much.

Regards.
Cody
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rdoproject.org/pipermail/users/attachments/20181003/f645d13d/attachment.html>