[rdo-users] Problems with Controller and Instance HA

Thu Oct 4 06:59:22 UTC 2018

Hi Cody,
On Wed, Oct 03, 2018 at 11:46:52AM -0400, Cody wrote:
> Hi everyone,
> 
> My cluster is deployed with both Controller and Instance HA. The deployment
> completed without errors, but I noticed something strange from the 'pcs
> status' output from the controllers:
> 
>  Clone Set: compute-unfence-trigger-clone [compute-unfence-trigger]
>      Started: [ overcloud-novacompute-0 ]
>      Stopped: [ overcloud-controller-0 overcloud-controller-1
> overcloud-controller-2 overcloud-novacompute-1 ]
>  nova-evacuate    (ocf::openstack:NovaEvacuate):    Started
> overcloud-controller-0
>  stonith-fence_ipmilan-002590a2d2c7    (stonith:fence_ipmilan):    Started
> overcloud-controller-1
>  stonith-fence_ipmilan-002590a1c641    (stonith:fence_ipmilan):    Started
> overcloud-controller-2
>  stonith-fence_ipmilan-002590f25822    (stonith:fence_ipmilan):    Started
> overcloud-controller-0
>  stonith-fence_ipmilan-002590f3977a    (stonith:fence_ipmilan):    Started
> overcloud-controller-2
>  stonith-fence_ipmilan-002590f2631a    (stonith:fence_ipmilan):    Started
> overcloud-controller-1
> 
> Notice the stonith-fence_ipmilan lines showed incorrect hosts for the last
> two devices. The MAC addresses are for the overcloud-novacompute-0 and
> overcloud-novacompute-1, but it got started on the controller nodes. Is
> this right?

That is correct. Stonith resource only run on full cluster nodes
(controllers) and not on pacemaker remote nodes (computes)

> There are also some failed actions from the status output:
> 
> Failed Actions:
> * overcloud-novacompute-1_start_0 on overcloud-controller-2 'unknown error'
> (1): call=3, status=Timed Out, exitreason='', last-rc-change='Wed Oct  3
> 03:48:55 2018', queued=0ms, exec=0ms
> * overcloud-novacompute-1_start_0 on overcloud-controller-0 'unknown error'
> (1): call=23, status=Timed Out, exitreason='', last-rc-change='Wed Oct  3
> 14:50:25 2018', queued=0ms, exec=0ms
> * overcloud-novacompute-1_start_0 on overcloud-controller-1 'unknown error'
> (1): call=3, status=Timed Out, exitreason='', last-rc-change='Wed Oct  3
> 03:47:51 2018', queued=0ms, exec=0ms

Are these after the fresh deployment or are these after you triggered a
crash on a compute node?

> I can spin up VMs, but cannot do failover. If I manually trigger a crash on
> one of the compute nodes, the affected VMs will remain at ERROR state and
> the affected compute node will be unable to rejoin the cluster afterward.
> 
> After a manual reboot on the affected compute node, it cannot start the pcs
> cluster service. Its container 'nova_compute' also remains unhealthy after
> reboot, with the lastest 'docker logs' message as:
> 
> ++ cat /run_command
> + CMD='/var/lib/nova/instanceha/check-run-nova-compute '
> + ARGS=
> + [[ ! -n '' ]]
> + . kolla_extend_start
> ++ [[ ! -d /var/log/kolla/nova ]]
> +++ stat -c %a /var/log/kolla/nova
> ++ [[ 2755 != \7\5\5 ]]
> ++ chmod 755 /var/log/kolla/nova
> ++ . /usr/local/bin/kolla_nova_extend_start
> +++ [[ ! -d /var/lib/nova/instances ]]
> + echo 'Running command:
> '\''/var/lib/nova/instanceha/check-run-nova-compute '\'''
> Running command: '/var/lib/nova/instanceha/check-run-nova-compute '
> + exec /var/lib/nova/instanceha/check-run-nova-compute
> Waiting for fence-down flag to be cleared
> Waiting for fence-down flag to be cleared
> Waiting for fence-down flag to be cleared
> Waiting for fence-down flag to be cleared
> Waiting for fence-down flag to be cleared
> Waiting for fence-down flag to be cleared
> ...
> 
> So I guess something may be wrong with fencing, but I have no idea what
> caused it and how to fix it. Any helps/suggestions/opinions would be
> greatly appreciated. Thank you very much.

So when a compute node crashes, one of the things that happens is that
it gets forcefully marked as down. You should be able to unblock it
manually via:
'nova service-list' to find the uuid of the forced-down service and then
'nova service-force-down --unset <uuid>'

I am not sure I understand the exact picture of the problem. Is it:
A) After I crash a compute node the VMs do not get resurrected on
another compute node?
B) The compute node I just crashed hangs at boot with the nova container
waiting for fence-down flag to be cleared?

Is only B) the issue or also A)?

For B) can you try the following?
"""
pcs resource update overcloud-novacompute-0 meta reconnect_interval=180s
pcs resource update overcloud-novacompute-1 meta reconnect_interval=180s
pcs resource cleanup --all
"""
and retry the process and report back?

cheers,
Michele
-- 
Michele Baldessari            <michele at acksyn.org>
C2A5 9DA3 9961 4FFB E01B  D0BC DDD4 DCCB 7515 5C6D