[Rdo-list] New deployment model for HA compute nodes - now with automated recovery of VMs

Wed Apr 8 22:57:23 UTC 2015

Dell - Internal Use - Confidential 

Thanks Andrew.
This is nice. 

-----Original Message-----
From: Andrew Beekhof [mailto:abeekhof at redhat.com] 
Sent: Wednesday, April 08, 2015 5:54 PM
To: Kanevsky, Arkady
Cc: rdo-list at redhat.com; milind.manjrekar at redhat.com; pmyers at redhat.com; mgarciam at redhat.com; bjayavel at redhat.com
Subject: Re: [Rdo-list] New deployment model for HA compute nodes - now with automated recovery of VMs

> On 9 Apr 2015, at 8:38 am, Arkady_Kanevsky at DELL.com wrote:
> 
> Dell - Internal Use - Confidential 
> 
> Andrew,
> Say you have 3 controller nodes cluster where all services but compute and swift run.
> HA proxy and pacemaker are configured on that controller cluster with each service configured under HA proxy and pacemaker.
> Now you are trying to define another pacemaker cluster that includes original controller cluster plus all compute nodes.

Its not another cluster, machines cannot be part of multiple clusters, the compute nodes are being added to the existing cluster.

> If you can put all nodes into one cluster and then define that a service runs on a subset of its node then it would work.

Correct, there are rules in place to ensure services only run in the "correct" subset.
See https://github.com/beekhof/osp-ha-deploy/blob/master/pcmk/compute-managed.scenario#L123 and the comment above it as well as all the "pcs constraint location" entries with "osprole eq compute"

> Integration and deployment tooling can be handled if that works.
> Thanks,
> Arkady
> 
> -----Original Message-----
> From: Andrew Beekhof [mailto:abeekhof at redhat.com] 
> Sent: Wednesday, April 08, 2015 4:52 PM
> To: Kanevsky, Arkady
> Cc: rdo-list at redhat.com; milind.manjrekar at redhat.com; Perry Myers; Marcos Garcia; Balaji Jayavelu
> Subject: Re: [Rdo-list] New deployment model for HA compute nodes - now with automated recovery of VMs
> 
> 
>> On 9 Apr 2015, at 6:52 am, Arkady_Kanevsky at DELL.com wrote:
>> 
>> Dell - Internal Use - Confidential 
>> 
>> Does that work with HA controller cluster where pacemaker non-remote runs?
> 
> I'm not sure I understand the question.
> The compute and control nodes are all part of a single cluster, its just that the compute nodes are not running a full stack.
> 
> Or do you mean, "could the same approach work for control nodes"?
> For example, could this be used to manage more than 16 swift ACO nodes... 
> 
> Short answer: yes
> Longer answer: yes, but there would likely need additional integration work required so don't expect it in a hurry
> 
> That specific case is on my mental list of options to explore in the future.
> 
> -- Andrew
> 
>> 
>> -----Original Message-----
>> From: rdo-list-bounces at redhat.com [mailto:rdo-list-bounces at redhat.com] On Behalf Of Andrew Beekhof
>> Sent: Tuesday, April 07, 2015 9:13 PM
>> To: rdo-list at redhat.com; rhos-pgm
>> Cc: milind.manjrekar at redhat.com; Perry Myers; Marcos Garcia; Balaji Jayavelu
>> Subject: [Rdo-list] New deployment model for HA compute nodes - now with automated recovery of VMs
>> 
>> Previously in order monitor the healthiness of compute nodes and the services running on them, we had to create single node clusters due to corosync's scaling limits.
>> We can now announce a new deployment model that allows Pacemaker to continue this role, but presents a single coherent view of the entire deployment while allowing us to scale beyond corosync's limits.
>> 
>> Having this single administrative domain then allows us to do clever things like automated recovery of VMs running on a failed or failing compute node.
>> 
>> The main difference with the previous deployment mode is that services on the compute nodes are now managed and driven by the Pacemaker cluster on the control plane.
>> The compute nodes do not become full members of the cluster and they no longer require the full cluster stack, instead they run pacemaker_remoted which acts as a conduit. 
>> 
>> Implementation Details:
>> 
>> - Pacemaker monitors the connection to pacemaker_remoted to verify that the node is reachable or not. 
>> Failure to talk to a node triggers recovery action.
>> 
>> - Pacemaker uses pacemaker_remoted to start compute node services in the same sequence as before (neutron-ovs-agent -> ceilometer-compute -> nova-compute).
>> 
>> - If a service fails to start, any services that depend on the FAILED service will not be started.
>> This avoids the issue of adding a broken node (back) to the pool.
>> 
>> - If a service fails to stop, the node where the service is running will be fenced. 
>> This is necessary to guarantee data integrity and a core HA concept (for the purposes of this particular discussion, please take this as a given).
>> 
>> - If a service's health check fails, the resource (and anything that depends on it) will be stopped and then restarted.
>> Remember that failure to stop will trigger a fencing action.
>> 
>> - A successful restart of all the services can only potentially affect network connectivity of the instances for a short period of time.
>> 
>> With these capabilities in place, we can exploit Pacemaker's node monitoring and fencing capabilities to drive nova host-evacuate for the failed compute nodes and recover the VMs elsewhere.
>> When a compute node fails, Pacemaker will:
>> 
>> 1. Execute 'nova service-disable'
>> 2. fence (power off) the failed compute node 3. fence_compute off (waiting for nova to detect the compute node is gone) 4. fence_compute on (a no-op unless the host happens to be up already) 5. Execute 'nova service-enable' when the compute node returns
>> 
>> Technically steps 1 and 5 are optional and they are aimed to improve user experience by immediately excluding a failed host from nova scheduling. 
>> The only benefit is a faster scheduling of VMs that happens during a failure (nova does not have to recognize a host is down, timeout and subsequently schedule the VM on another host).
>> 
>> Step 2 will make sure the host is completely powered off and nothing is running on the host.
>> Optionally, you can have the failed host reboot which would potentially allow it to re-enter the pool.
>> 
>> We have an implementation for Step 3 but the ideal solution depends on extensions to the nova API.
>> Currently fence_compute loops, waiting for nova to recognise that the failed host is down, before we make a host-evacuate call which triggers nova to restart the VMs on another host.
>> The discussed nova API extensions will speed up recovery times by allowing fence_compute to proactively push that information into nova instead.
>> 
>> 
>> To take advantage of the VM recovery features:
>> 
>> - VMs need to be running off a cinder volume or using shared ephemeral storage (like RBD or NFS)
>> - If VM is not running using shared storage, recovery of the instance on a new compute node would need to revert to a previously stored snapshot/image in Glance (potentially losing state, but in some cases that may not matter)
>> - RHEL7.1+ required for infrastructure nodes (controllers and compute). Instance guests can run anything.
>> - Compute nodes need to have a working fencing mechanism (IPMI, hardware watchdog, etc)
>> 
>> 
>> Detailed instructions for deploying this new model are of course available on Github:
>> 
>>   https://github.com/beekhof/osp-ha-deploy/blob/master/ha-openstack.md#compute-node-implementation
>> 
>> It has been successfully deployed in our labs, but we'd really like to hear how it works for you in the field.
>> Please contact me if you encounter any issues.
>> 
>> -- Andrew
>> 
>> _______________________________________________
>> Rdo-list mailing list
>> Rdo-list at redhat.com
>> https://www.redhat.com/mailman/listinfo/rdo-list
>> 
>> To unsubscribe: rdo-list-unsubscribe at redhat.com