What is Failover Clustering?

by Jacqueline Emigh

Learn how failover clusters work and why this technology can either eliminate or minimize system downtime for server-based software applications and services.

A failover cluster is a set of computer servers that work together to provide either high availability (HA) or continuous availability (CA). If one of the servers goes down, another node in the cluster can assume its workload with either minimum or no downtime through a process referred to as failover.

Some failover clusters use physical servers only, whereas others involve virtual machines (VMs).

The main purpose of a failover cluster is to provide either CA or HA for applications and services. Also referred to as fault tolerant (FT) clusters, CA clusters allow end users to keep utilizing applications and services without experiencing any timeouts if a server fails. With HA clusters, on the other hand, a user might undergo a brief interruption in service, but the system will recover automatically with no data loss and minimum downtime.

A cluster is made up of two or more nodes, or servers, which are generally connected through physical cables in addition to software. Other kinds of clustering technology can be used for purposes such as load balancing, storage, and concurrent or parallel processing. Some implementations combine failover clusters with additional clustering technology.

failover clustering

To protect your data, a dedicated network connects the failover cluster nodes, providing essential CA or HA backup.

How Failover Clusters Work

While CA failover clusters are designed for 100 percent availability, HA clusters attempt 99.999 percent availability, also known as “five nines,” for downtime amounting to no more than 5.26 minutes yearly. As a trade off for their greater availability, though, CA clusters are more costly to implement, due to increased hardware requirements.

High Availability Failover Clusters

In a high availability cluster, groups of independent servers are loosely coupled to share resources and data throughout the system. All nodes in a failover cluster have access to shared storage. High availability clusters also include a monitoring connection which servers use to check the “heartbeat” or health of one another. At least one of the nodes in a cluster is active, while at least one is passive.

In a simple two-node configuration, for example, if Node 1 fails, Node 2 uses the heartbeat connection to recognize the failure and then configures itself as the active node. Clustering software installed on every node in the cluster makes sure than clients connect to an active node.

In larger configurations, cluster management can be performed by dedicated servers. A cluster management server constantly sends out heartbeat signals to determine if any of the nodes is failing, and if so, to direct another node to assume the load.

Some cluster management software provides HA for virtual machines (VMs) by pooling them and the physical servers they reside on into a cluster. If failure occurs, the VMs on the failed host are restarted on alternate hosts.

Shared storage does pose a risk as a potential single point of failure. However, the use of RAID 6 together with RAID 10 can help to ensure that service will continue even if two hard drives fail.

If all servers are plugged into the same power grid, electrical power can represent another single point of failure. Yet the nodes can be safeguarded by equipping each with a separate uninterruptible power supply (UPS).

Continuous Availability Failover Clusters

In contrast, a fault-tolerant cluster consists of multiple systems, which share a single copy of a computer's OS. Software commands issued by one system are also executed on the other system. CA can only be achieved by using a continuously available and nearly exact copy of a physical or virtual machine running the service. This redundancy model is called 2N.

CA requires the organization to use formatted computer equipment, plus a secondary UPS. CA systems can also compensate for many different sorts of failures.

A fault tolerant system can automatically detect a failure of not just a hard drive but a computer processor unit, I/O subsystem, power supply, or network component, for instance. The failure point can be immediately identified, and a backup component or procedure can take its place instantly without interruption in service.

In a CA failover cluster, the operating system (OS) is outfitted with an interface permitting a software programmer to do checkpoints of critical data at predetermined points in a transaction.

Clustering software can also be used to group together two or more servers to act as a single virtual server. You can also create many other CA failover setups. For example, a cluster might be configured so that if one of the virtual servers fails, the others respond by temporarily removing the virtual server from the cluster. It then automatically redistributes the workload among the remaining servers until the downed server is ready to go online again.

An alternative to CA failover clusters is use of a “double” hardware server in which all physical components are duplicated. Calculations are done independently and simultaneously on the same hardware system. Yet this option can be even more expensive.

These “double” hardware systems perform synchronization by using a dedicated node that keeps tabs on the results coming from both physical servers. Stratus, a maker of these specialized fault tolerant hardware servers, promises that system downtime won’t amount to more than 32 seconds each year. However, the cost of one Stratus server with dual CPUs for each synchronized module is estimated at approximately $160,000 per synchronized nodule.

Practical Applications of Failover Clusters

Ongoing Availability of Mission Critical Applications

Fault tolerant systems are a necessity for computers used in online transaction processing (OLTP) systems. OLTP, which demands 100 percent availability, is used in airline reservations systems, electronic stock trading, and ATM banking, for example.

Many other types of organizations also use either CA clusters or fault tolerant computers for mission critical applications, such as businesses in the fields of manufacturing, logistics, and retailing. Applications include e-commerce, order management, and employee time clock systems, for example.

For clustering applications and services requiring only “five nines” uptime, though, high availability clusters are generally regarded as adequate.

Disaster Recovery

Disaster recovery is another practical application for failover clusters. Of course, it’s highly advisable for failover servers to be housed at remote sites in the event that a disaster such as a fire or flood takes out all physical hardware and software in the primary data center.

In Windows Server 2016 and 2019, for example, Microsoft provides Storage Replica, a technology allowing replication of volumes between servers for disaster recovery. The technology includes a “stretch failover” feature for failover clusters spanning two geographic sites.

By stretching failover clusters, organizations can replicate among multiple data centers. If a disaster strikes at one location, all data continues to exist on failover servers at other sites.

Database Replication

According to Microsoft, the company originally introduced Windows Server Failover Cluster (WSFC) in Windows Server 2016 mainly to protect “mission-critical” applications such as its SQL Server database and Microsoft Exchange communications server.

Other database providers, too, offer failover cluster technology for database replication. MySQL Cluster, for example, includes a heartbeat mechanism for instant failure detection, typically within one second, to other nodes in the cluster, with no service interruptions to clients. A geographic replication feature enables databases to be mirrored to remote locations.

Failover Cluster Types

VMWare Failover Clusters

Among the virtualization products available, VMware offers several virtualization tools for VM clusters. vSphere 6 Fault Tolerance provides a CA architecture that exactly replicates a VMware virtual machine on an alternate physical host in case the main host server goes down.

A second product, VMware HA, follows the approach of providing HA for VMs by pooling them and their hosts into a cluster for automatic failover. Using VMware HA in conjunction with VMWare’s Distributed Resource Scheduler (DRS) adds load balancing, for faster rebalancing of VMs after VMware HA has moved the VMs to other hosts.

Windows Server Failover Cluster (WSFC)

You can create Hyper-V failover servers with the use of WFSC, a feature in Windows 2016 and 2019 that monitors clustered physical servers, providing failover if needed. WFSC also monitors clustered roles, formerly referred to as clustered applications and services. If a clustered role isn’t working correctly, it is either restarted or moved to another node.

WFSC includes Microsoft’s previous  Cluster Shared Volume (CSV) technology to provide a consistent, distributed namespace for accessing shared storage from all nodes. In addition, WSFC supports CA file share storage for SQL Server and Microsoft Hyper-V cluster VMs. It also supports HA roles running on physical servers and Hyper-V cluster VMs. Here is a Hyper-V cluster diagram.

SQL Server Failover Clusters

In SQL Server 2017, Microsoft introduced Always On, an HA solution that uses WSFC as a platform technology, registering SQL Server components as WSFC cluster resources. According to Microsoft, related resources are combined into a role which is dependent on other WSFC resources. WSFC can then identify and communicate the need to either restart a SQL Server instance or automatically fail it over to a different node.

Red Hat Linux Failover Clusters

OS makers other than Microsoft also provide their own failover cluster technologies. For example, Red Hat Enterprise Linux (RHEL) users can create HA failover clusters with the High Availability Add-On and Red Hat Global File System (GFS/GFS2). Support is provided for single-cluster stretch clusters spanning multiple sites as well as multi-site of “disaster-tolerant” clusters. The multi-site clusters generally use storage area network (SAN)-enabled data storage replication.

This article was originally published on Wednesday May 15th 2019
Mobile Site | Full Site