<div dir="ltr">Hi Andrew,<div><br></div><div>I've checked your git, great work, but I'm using native toois with keepalived approach, using mmonit utility to monitor the infrastructure without pacemaker/corosync.</div><div><br></div><div>I'm testing this approach to evacuate and disable a compute node, if something fails. What approach do you consider best, having in mind that a external monitoring tool like mmonit is not "cluster aware" and doesn't do things like fencing the dead node like pacemaker does?</div><div><br></div><div>Thank you.</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Apr 8, 2015 at 3:12 AM, Andrew Beekhof <span dir="ltr"><<a href="mailto:abeekhof@redhat.com" target="_blank">abeekhof@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Previously in order monitor the healthiness of compute nodes and the services running on them, we had to create single node clusters due to corosync's scaling limits.<br>

We can now announce a new deployment model that allows Pacemaker to continue this role, but presents a single coherent view of the entire deployment while allowing us to scale beyond corosync's limits.<br>

<br>

Having this single administrative domain then allows us to do clever things like automated recovery of VMs running on a failed or failing compute node.<br>

<br>

The main difference with the previous deployment mode is that services on the compute nodes are now managed and driven by the Pacemaker cluster on the control plane.<br>

The compute nodes do not become full members of the cluster and they no longer require the full cluster stack, instead they run pacemaker_remoted which acts as a conduit.<br>

<br>

Implementation Details:<br>

<br>

- Pacemaker monitors the connection to pacemaker_remoted to verify that the node is reachable or not.<br>

  Failure to talk to a node triggers recovery action.<br>

<br>

- Pacemaker uses pacemaker_remoted to start compute node services in the same sequence as before (neutron-ovs-agent -> ceilometer-compute -> nova-compute).<br>

<br>

- If a service fails to start, any services that depend on the FAILED service will not be started.<br>

  This avoids the issue of adding a broken node (back) to the pool.<br>

<br>

- If a service fails to stop, the node where the service is running will be fenced.<br>

  This is necessary to guarantee data integrity and a core HA concept (for the purposes of this particular discussion, please take this as a given).<br>

<br>

- If a service's health check fails, the resource (and anything that depends on it) will be stopped and then restarted.<br>

  Remember that failure to stop will trigger a fencing action.<br>

<br>

- A successful restart of all the services can only potentially affect network connectivity of the instances for a short period of time.<br>

<br>

With these capabilities in place, we can exploit Pacemaker's node monitoring and fencing capabilities to drive nova host-evacuate for the failed compute nodes and recover the VMs elsewhere.<br>

When a compute node fails, Pacemaker will:<br>

<br>

1. Execute 'nova service-disable'<br>

2. fence (power off) the failed compute node<br>

3. fence_compute off (waiting for nova to detect the compute node is gone)<br>

4. fence_compute on (a no-op unless the host happens to be up already)<br>

5. Execute 'nova service-enable' when the compute node returns<br>

<br>

Technically steps 1 and 5 are optional and they are aimed to improve user experience by immediately excluding a failed host from nova scheduling.<br>

The only benefit is a faster scheduling of VMs that happens during a failure (nova does not have to recognize a host is down, timeout and subsequently schedule the VM on another host).<br>

<br>

Step 2 will make sure the host is completely powered off and nothing is running on the host.<br>

Optionally, you can have the failed host reboot which would potentially allow it to re-enter the pool.<br>

<br>

We have an implementation for Step 3 but the ideal solution depends on extensions to the nova API.<br>

Currently fence_compute loops, waiting for nova to recognise that the failed host is down, before we make a host-evacuate call which triggers nova to restart the VMs on another host.<br>

The discussed nova API extensions will speed up recovery times by allowing fence_compute to proactively push that information into nova instead.<br>

<br>

<br>

To take advantage of the VM recovery features:<br>

<br>

- VMs need to be running off a cinder volume or using shared ephemeral storage (like RBD or NFS)<br>

- If VM is not running using shared storage, recovery of the instance on a new compute node would need to revert to a previously stored snapshot/image in Glance (potentially losing state, but in some cases that may not matter)<br>

- RHEL7.1+ required for infrastructure nodes (controllers and compute). Instance guests can run anything.<br>

- Compute nodes need to have a working fencing mechanism (IPMI, hardware watchdog, etc)<br>

<br>

<br>

Detailed instructions for deploying this new model are of course available on Github:<br>

<br>

    <a href="https://github.com/beekhof/osp-ha-deploy/blob/master/ha-openstack.md#compute-node-implementation" target="_blank">https://github.com/beekhof/osp-ha-deploy/blob/master/ha-openstack.md#compute-node-implementation</a><br>

<br>

It has been successfully deployed in our labs, but we'd really like to hear how it works for you in the field.<br>

Please contact me if you encounter any issues.<br>

<br>

-- Andrew<br>

<br>

_______________________________________________<br>

Rdo-list mailing list<br>

<a href="mailto:Rdo-list@redhat.com">Rdo-list@redhat.com</a><br>

<a href="https://www.redhat.com/mailman/listinfo/rdo-list" target="_blank">https://www.redhat.com/mailman/listinfo/rdo-list</a><br>

<br>

To unsubscribe: <a href="mailto:rdo-list-unsubscribe@redhat.com">rdo-list-unsubscribe@redhat.com</a><br>

</blockquote></div><br></div>