Hi Cody,
On Wed, Oct 03, 2018 at 11:46:52AM -0400, Cody wrote:
Hi everyone,
My cluster is deployed with both Controller and Instance HA. The deployment
completed without errors, but I noticed something strange from the 'pcs
status' output from the controllers:
Clone Set: compute-unfence-trigger-clone [compute-unfence-trigger]
Started: [ overcloud-novacompute-0 ]
Stopped: [ overcloud-controller-0 overcloud-controller-1
overcloud-controller-2 overcloud-novacompute-1 ]
nova-evacuate (ocf::openstack:NovaEvacuate): Started
overcloud-controller-0
stonith-fence_ipmilan-002590a2d2c7 (stonith:fence_ipmilan): Started
overcloud-controller-1
stonith-fence_ipmilan-002590a1c641 (stonith:fence_ipmilan): Started
overcloud-controller-2
stonith-fence_ipmilan-002590f25822 (stonith:fence_ipmilan): Started
overcloud-controller-0
stonith-fence_ipmilan-002590f3977a (stonith:fence_ipmilan): Started
overcloud-controller-2
stonith-fence_ipmilan-002590f2631a (stonith:fence_ipmilan): Started
overcloud-controller-1
Notice the stonith-fence_ipmilan lines showed incorrect hosts for the last
two devices. The MAC addresses are for the overcloud-novacompute-0 and
overcloud-novacompute-1, but it got started on the controller nodes. Is
this right?
That is correct. Stonith resource only run on full cluster nodes
(controllers) and not on pacemaker remote nodes (computes)
There are also some failed actions from the status output:
Failed Actions:
* overcloud-novacompute-1_start_0 on overcloud-controller-2 'unknown error'
(1): call=3, status=Timed Out, exitreason='', last-rc-change='Wed Oct 3
03:48:55 2018', queued=0ms, exec=0ms
* overcloud-novacompute-1_start_0 on overcloud-controller-0 'unknown error'
(1): call=23, status=Timed Out, exitreason='', last-rc-change='Wed Oct 3
14:50:25 2018', queued=0ms, exec=0ms
* overcloud-novacompute-1_start_0 on overcloud-controller-1 'unknown error'
(1): call=3, status=Timed Out, exitreason='', last-rc-change='Wed Oct 3
03:47:51 2018', queued=0ms, exec=0ms
Are these after the fresh deployment or are these after you triggered a
crash on a compute node?
I can spin up VMs, but cannot do failover. If I manually trigger a
crash on
one of the compute nodes, the affected VMs will remain at ERROR state and
the affected compute node will be unable to rejoin the cluster afterward.
After a manual reboot on the affected compute node, it cannot start the pcs
cluster service. Its container 'nova_compute' also remains unhealthy after
reboot, with the lastest 'docker logs' message as:
++ cat /run_command
+ CMD='/var/lib/nova/instanceha/check-run-nova-compute '
+ ARGS=
+ [[ ! -n '' ]]
+ . kolla_extend_start
++ [[ ! -d /var/log/kolla/nova ]]
+++ stat -c %a /var/log/kolla/nova
++ [[ 2755 != \7\5\5 ]]
++ chmod 755 /var/log/kolla/nova
++ . /usr/local/bin/kolla_nova_extend_start
+++ [[ ! -d /var/lib/nova/instances ]]
+ echo 'Running command:
'\''/var/lib/nova/instanceha/check-run-nova-compute '\'''
Running command: '/var/lib/nova/instanceha/check-run-nova-compute '
+ exec /var/lib/nova/instanceha/check-run-nova-compute
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
Waiting for fence-down flag to be cleared
...
So I guess something may be wrong with fencing, but I have no idea what
caused it and how to fix it. Any helps/suggestions/opinions would be
greatly appreciated. Thank you very much.
So when a compute node crashes, one of the things that happens is that
it gets forcefully marked as down. You should be able to unblock it
manually via:
'nova service-list' to find the uuid of the forced-down service and then
'nova service-force-down --unset <uuid>'
I am not sure I understand the exact picture of the problem. Is it:
A) After I crash a compute node the VMs do not get resurrected on
another compute node?
B) The compute node I just crashed hangs at boot with the nova container
waiting for fence-down flag to be cleared?
Is only B) the issue or also A)?
For B) can you try the following?
"""
pcs resource update overcloud-novacompute-0 meta reconnect_interval=180s
pcs resource update overcloud-novacompute-1 meta reconnect_interval=180s
pcs resource cleanup --all
"""
and retry the process and report back?
cheers,
Michele
--
Michele Baldessari <michele(a)acksyn.org>
C2A5 9DA3 9961 4FFB E01B D0BC DDD4 DCCB 7515 5C6D