<div dir="ltr"><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">A follow up note on this:<br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">I passed docker-ha.yaml and compute-instanceha.yaml along with fencing.yaml to the CLI at the time of deployment. I hope this is the correct way to achieve Controller HA and Instance HA in a single deployment. But evidently, something is wrong here with the compute fencing and unfence resources. Any helps would be greatly appreciated.</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Thank you,<br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Cody<br></div><br><div class="gmail_quote"><div dir="ltr">On Wed, Oct 3, 2018 at 11:46 AM Cody <<a href="mailto:codeology.lab@gmail.com">codeology.lab@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div>Hi everyone,</div><div><br></div><div><div style="font-family:arial,helvetica,sans-serif">My cluster is deployed with both Controller and Instance HA. The deployment completed without errors, but I noticed something strange from the 'pcs status' output from the controllers:</div><div style="font-family:arial,helvetica,sans-serif"><br></div><div style="font-family:arial,helvetica,sans-serif"> Clone Set: compute-unfence-trigger-clone [compute-unfence-trigger]<br>     Started: [ overcloud-novacompute-0 ]<br>     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 overcloud-novacompute-1 ]<br> nova-evacuate    (ocf::openstack:NovaEvacuate):    Started overcloud-controller-0<br> stonith-fence_ipmilan-002590a2d2c7    (stonith:fence_ipmilan):    Started overcloud-controller-1<br> stonith-fence_ipmilan-002590a1c641    (stonith:fence_ipmilan):    Started overcloud-controller-2<br> stonith-fence_ipmilan-002590f25822    (stonith:fence_ipmilan):    Started overcloud-controller-0<br> stonith-fence_ipmilan-002590f3977a    (stonith:fence_ipmilan):    Started overcloud-controller-2<br> stonith-fence_ipmilan-002590f2631a    (stonith:fence_ipmilan):    Started overcloud-controller-1</div><div style="font-family:arial,helvetica,sans-serif"></div><div style="font-family:arial,helvetica,sans-serif"><br></div><div style="font-family:arial,helvetica,sans-serif">Notice the stonith-fence_ipmilan lines showed incorrect hosts for the last two devices. The MAC addresses are for the overcloud-novacompute-0 and overcloud-novacompute-1, but it got started on the controller nodes. Is this right? <br></div><div style="font-family:arial,helvetica,sans-serif"><br></div><div style="font-family:arial,helvetica,sans-serif">There are also some failed actions from the status output:</div><div style="font-family:arial,helvetica,sans-serif"><br></div><div style="font-family:arial,helvetica,sans-serif;margin-left:40px">Failed Actions:<br>* overcloud-novacompute-1_start_0 on overcloud-controller-2 'unknown error' (1): call=3, status=Timed Out, exitreason='', last-rc-change='Wed Oct  3 03:48:55 2018', queued=0ms, exec=0ms<br>* overcloud-novacompute-1_start_0 on overcloud-controller-0 'unknown error' (1): call=23, status=Timed Out, exitreason='', last-rc-change='Wed Oct  3 14:50:25 2018', queued=0ms, exec=0ms<br>* overcloud-novacompute-1_start_0 on overcloud-controller-1 'unknown error' (1): call=3, status=Timed Out, exitreason='', last-rc-change='Wed Oct  3 03:47:51 2018', queued=0ms, exec=0ms<br></div><div style="font-family:arial,helvetica,sans-serif"><br></div><div style="font-family:arial,helvetica,sans-serif">I can spin up VMs, but cannot do failover. If I manually trigger a crash on one of the compute nodes, the affected VMs will remain at ERROR state and the affected compute node will be unable to rejoin the cluster afterward. <br></div><div style="font-family:arial,helvetica,sans-serif"><br></div><div style="font-family:arial,helvetica,sans-serif">After a manual reboot on the affected compute node, it cannot start the pcs cluster service. Its container 'nova_compute' also remains unhealthy after reboot, with the lastest 'docker logs' message as:</div><div style="font-family:arial,helvetica,sans-serif"><br></div><div style="font-family:arial,helvetica,sans-serif">++ cat /run_command<br>+ CMD='/var/lib/nova/instanceha/check-run-nova-compute '<br>+ ARGS=<br>+ [[ ! -n '' ]]<br>+ . kolla_extend_start<br>++ [[ ! -d /var/log/kolla/nova ]]<br>+++ stat -c %a /var/log/kolla/nova<br>++ [[ 2755 != \7\5\5 ]]<br>++ chmod 755 /var/log/kolla/nova<br>++ . /usr/local/bin/kolla_nova_extend_start<br>+++ [[ ! -d /var/lib/nova/instances ]]<br>+ echo 'Running command: '\''/var/lib/nova/instanceha/check-run-nova-compute '\'''<br>Running command: '/var/lib/nova/instanceha/check-run-nova-compute '<br>+ exec /var/lib/nova/instanceha/check-run-nova-compute<br>Waiting for fence-down flag to be cleared<br>Waiting for fence-down flag to be cleared<br>Waiting for fence-down flag to be cleared<br>Waiting for fence-down flag to be cleared<br>Waiting for fence-down flag to be cleared<br>Waiting for fence-down flag to be cleared<br>...</div><div style="font-family:arial,helvetica,sans-serif"><br></div><div style="font-family:arial,helvetica,sans-serif">So I guess something may be wrong with fencing, but I have no idea what caused it and how to fix it. Any helps/suggestions/opinions would be greatly appreciated. Thank you very much.</div><div style="font-family:arial,helvetica,sans-serif"><br></div><div style="font-family:arial,helvetica,sans-serif"><br></div><div style="font-family:arial,helvetica,sans-serif">Regards.</div><div style="font-family:arial,helvetica,sans-serif">Cody<br></div><br></div></div></div></div></div>

</blockquote></div></div>