[rdo-users] Problems with Controller and Instance HA

Wed Oct 3 22:32:04 UTC 2018

A follow up note on this:

I passed docker-ha.yaml and compute-instanceha.yaml along with fencing.yaml
to the CLI at the time of deployment. I hope this is the correct way to
achieve Controller HA and Instance HA in a single deployment. But
evidently, something is wrong here with the compute fencing and unfence
resources. Any helps would be greatly appreciated.

Thank you,
Cody

On Wed, Oct 3, 2018 at 11:46 AM Cody <codeology.lab at gmail.com> wrote:

> Hi everyone,
>
> My cluster is deployed with both Controller and Instance HA. The
> deployment completed without errors, but I noticed something strange from
> the 'pcs status' output from the controllers:
>
>  Clone Set: compute-unfence-trigger-clone [compute-unfence-trigger]
>      Started: [ overcloud-novacompute-0 ]
>      Stopped: [ overcloud-controller-0 overcloud-controller-1
> overcloud-controller-2 overcloud-novacompute-1 ]
>  nova-evacuate    (ocf::openstack:NovaEvacuate):    Started
> overcloud-controller-0
>  stonith-fence_ipmilan-002590a2d2c7    (stonith:fence_ipmilan):    Started
> overcloud-controller-1
>  stonith-fence_ipmilan-002590a1c641    (stonith:fence_ipmilan):    Started
> overcloud-controller-2
>  stonith-fence_ipmilan-002590f25822    (stonith:fence_ipmilan):    Started
> overcloud-controller-0
>  stonith-fence_ipmilan-002590f3977a    (stonith:fence_ipmilan):    Started
> overcloud-controller-2
>  stonith-fence_ipmilan-002590f2631a    (stonith:fence_ipmilan):    Started
> overcloud-controller-1
>
> Notice the stonith-fence_ipmilan lines showed incorrect hosts for the last
> two devices. The MAC addresses are for the overcloud-novacompute-0 and
> overcloud-novacompute-1, but it got started on the controller nodes. Is
> this right?
>
> There are also some failed actions from the status output:
>
> Failed Actions:
> * overcloud-novacompute-1_start_0 on overcloud-controller-2 'unknown
> error' (1): call=3, status=Timed Out, exitreason='', last-rc-change='Wed
> Oct  3 03:48:55 2018', queued=0ms, exec=0ms
> * overcloud-novacompute-1_start_0 on overcloud-controller-0 'unknown
> error' (1): call=23, status=Timed Out, exitreason='', last-rc-change='Wed
> Oct  3 14:50:25 2018', queued=0ms, exec=0ms
> * overcloud-novacompute-1_start_0 on overcloud-controller-1 'unknown
> error' (1): call=3, status=Timed Out, exitreason='', last-rc-change='Wed
> Oct  3 03:47:51 2018', queued=0ms, exec=0ms
>
> I can spin up VMs, but cannot do failover. If I manually trigger a crash
> on one of the compute nodes, the affected VMs will remain at ERROR state
> and the affected compute node will be unable to rejoin the cluster
> afterward.
>
> After a manual reboot on the affected compute node, it cannot start the
> pcs cluster service. Its container 'nova_compute' also remains unhealthy
> after reboot, with the lastest 'docker logs' message as:
>
> ++ cat /run_command
> + CMD='/var/lib/nova/instanceha/check-run-nova-compute '
> + ARGS=
> + [[ ! -n '' ]]
> + . kolla_extend_start
> ++ [[ ! -d /var/log/kolla/nova ]]
> +++ stat -c %a /var/log/kolla/nova
> ++ [[ 2755 != \7\5\5 ]]
> ++ chmod 755 /var/log/kolla/nova
> ++ . /usr/local/bin/kolla_nova_extend_start
> +++ [[ ! -d /var/lib/nova/instances ]]
> + echo 'Running command:
> '\''/var/lib/nova/instanceha/check-run-nova-compute '\'''
> Running command: '/var/lib/nova/instanceha/check-run-nova-compute '
> + exec /var/lib/nova/instanceha/check-run-nova-compute
> Waiting for fence-down flag to be cleared
> Waiting for fence-down flag to be cleared
> Waiting for fence-down flag to be cleared
> Waiting for fence-down flag to be cleared
> Waiting for fence-down flag to be cleared
> Waiting for fence-down flag to be cleared
> ...
>
> So I guess something may be wrong with fencing, but I have no idea what
> caused it and how to fix it. Any helps/suggestions/opinions would be
> greatly appreciated. Thank you very much.
>
>
> Regards.
> Cody
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rdoproject.org/pipermail/users/attachments/20181003/40abbad7/attachment.html>