Design and Configuration of vSphere Fault Tolerance
Don't gamble on your infrastructure's resilience. Make sure you have redundancy built in.
vSphere Fault Tolerance has been around since the 4.x days, and had typically been seen as a neat idea but of limited practicality. The problem was that besides the substantial prerequisites, the feature could only protect virtual machines (VMs) with 1 virtual CPU (vCPU). In the real world, machines critical enough to be protected by Fault Tolerance (FT) tend to have more than 1 vCPU. Rumors about VMware adding SMP Fault Tolerance (SMP means symmetric multiprocessing, or "multi-CPU") have been around for a few years now. Finally, with the release of vSphere 6.0 in early 2015, the long awaited feature became available in a GA release.
Although FT is quite simple to enable, what's a bit more challenging is designing properly for it in the first place. Once the proper design is in place, protecting a VM is a matter of a few clicks. In this walkthrough, I'll discuss design considerations, prerequisites, and finally the configuration of FT on a multi-CPU VM. Before any of the work begins, one must consider the design of the overall environment.
Because of the value of FT in terms of risk mitigation, a certain level of complexity and overhead is tolerable. But the are quite a few things to consider when deciding whether or not to configure and deploy SMP FT.
- The Secondary VM is essentially a powered on, full clone of the machine being protected (the Primary VM). This means that for any machine being protected, the resource requirements double: twice the CPU, twice the memory, and with this iteration of FT, twice the disk space. (I'll address disk space later.)
- Snapshots are not supported with FT. Any existing snapshots must be committed before enabling FT for a VM, and an administrator will not be able to take any snapshots once FT is active.
- Storage vMotion is not possible for VMs leveraging FT.
- FT can't be run on top of VSAN at this time. The same is true for any Virtual Volumes (VVOLs) datastore.
- RDMs, NPIV, CPU/Memory Hot-plug, and VMDKs greater than 2TB are all incompatible with FT.
- FT logging is unencrypted, which means that if it isn't isolated from other network traffic, it's a security risk. It's recommended to deploy a totally isolated network segment for FT logging.
- vMotion and FT logging must be on different subnets. VMware requires a dedicated NIC for each type of traffic, but recommends a minimum of three NICs for availability purposes.
- Hosts running the Primary and Secondary VMs should operate at approximately the same processor frequencies, otherwise the Secondary VM might be restarted more frequently. Because of this, power management features should be disabled in the BIOS to eliminate variances in clock speed for power saving.
- VMware recommends the use of 10GbE NICs and a network configured to use Jumbo Frames.
- An Enterprise Pluslicense is required for protecting up to 4 vCPUs. Standard and Enterprise editions only protect up to 2.
If these considerations haven't made FT seem impractical or overwhelming, the next step is to consider the prerequisites. These are things that will need to be in place prior to enabling SMP FT. As soon as these requirements are checked off, I'll actually walk through the process of configuring and enabling FT.
Synchronously logging all the commands a CPU executes from one machine to another is no easy task, so it's unsurprising that the list of prerequisites is a bit long. If a workload is critical enough, however, making it through this checklist is totally worth it. Before attempting to enable FT on a High Availability (HA) cluster, ensure the following prerequisites are met.
- FT is a function of an HA cluster. Before proceeding, ensure that at least three hosts are in an HA cluster. (Configuring this is out of the scope of this article.)
- Ensure that Enhanced vMotion Compatibility (EVC) is configured on the cluster.
- Ensure that hosts have Hardware Virtualization (Intel VT or AMD-V) enabled in the BIOS. (This can be checked by running esxcfg-info|grep "HV Support" from an SSH session. Check the output against this Knowledge Base article.)
- Ensure that VMs to be protected are on shared storage which is not a VSAN or VVOL datastore (e.g., a standard VMFS volume via FC or iSCSI, or an NFS export).
- Enabling FT sets a memory reservation for the protected VM. Verify HA Admission Control settings to ensure that setting a reservation won't throw slot size calculations out of whack.
- Ensure that SSL Certificate checking is enabled for the vCenter Server.
- In the case of SMP FT, ensure the ESXi hosts in the cluster are at vSphere 6.0 or greater.
- Ensure VMs to be protected have no snapshots.
To make all of this checking a bit easier, the Web Client has a tool for this. Highlight the cluster object, browse to the Monitor tab, and then select the Profile Compliance section. Click the "Check Compliance Now" button to check the hosts in the cluster against the requirements for a proper HA and DRS configuration. Because FT is a subset of HA, all the cluster level prerequisites will be checked. Figure 1 shows a successful compliance check where all of the hosts have valid configurations.
For demonstration's sake, I unconfigured the FT logging VMkernel port for one of my hosts and re-ran the compliance check. Figure 2 shows what an unsuccessful compliance check looks like.
Note that below the Host Compliance section is an explanation of compliance faults (if there are any). In this case, it shows that FT isn't supported on this host because FT logging isn't enabled.
Now, from the beginning, I'll walk through configuring hosts for FT, then through enabling and testing FT on a VM. The first step is to create an HA cluster. Distributed Resource Scheduling (DRS) is also supported with FT. (Again, configuring the cluster as a whole is outside the scope of this article.)
With three hosts in a cluster, networking must be configured for vMotion and FT logging. As a reminder, VMware specifies that these should be separate physical adapters and separate VMkernel interfaces. Figure 3 shows the VMkernel adapters for one of my three ESXi hosts. Interface vmk1 is configured on VLAN 101 for vMotion (and uses a specific uplink from my VDS uplinks). Interface vmk2 is configured on VLAN 102 for FT logging (and uses a different uplink). Figure 4 shows the configuration of the dvPortGroup for FT in regards to Teaming and Failover.
Once the hosts are configured, VMs can be protected. FT is enabled for a VM by right-clicking the VM, highlighting Fault Tolerance, and selecting Turn On. Figure 5 shows this option, as well as the options that will be available once FT is enabled (currently greyed out).
At this stage, vSphere performs a validation on the specific VM to ensure compatibility with FT. The following items are checked by this validation:
- SSL certificate checking must be enabled in the vCenter Server settings.
- The host must be in a vSphere HA cluster or a mixed vSphere HA and DRS cluster.
- The host must have ESXi 6.x or greater installed (ESX/ESXi 4.x or greater for legacy FT).
- The VM must not have snapshots.
- The VM must not be a template.
- The VM must not have vSphere HA disabled.
- The VM must not have a video device with 3D enabled.
- All virtual hardware must be compatible with FT.
In Figure 6, the validation check on the VM throws an error. This is because I'm attempting to enable FT on a VM residing on my VSAN datastore.
If everything checks out during this validation, FT can be enabled for the VM. Figure 7 shows that when there are no validation errors, the dialogue box offers Yes and No buttons to answer the question, "Turn on Fault Tolerance for this virtual machine?" Selecting Yes causes the wizard to proceed to the configuration settings for this particular FT VM.
First, the administrator is allowed to select which datastore the Secondary VM's files should be stored on. Next, the host for the Secondary VM to run on is selected. Finally, as seen in Figure 7, a confirmation screen is shown and the Finish button causes the provisioning of the Secondary VM to begin.
The Secondary VM is created, and then FT begins synchronizing the VMs. This took some time in my lab, because I'm running in a less-than-ideal configuration where 1GbE uplinks are used as compared to the preference of 10GbE. Once the synchronization is complete, however, the Summary page for the VM will show a Fault Tolerance status of Protected. Figure 8 shows a successfully protected VM. It's interesting to note the "Log bandwidth usage" field. FT logging for this VM is consuming almost two of my 1 GbE uplinks and would have totally saturated one. And this is only one machine. This is why 10GbE networking is preferred.
The final piece of the FT puzzle is ensuring that it will actually be functional in the event it's needed. This can be accomplished using the built in Test Failover function. Be aware that this is a live failover, meaning that vSphere will actually induce a failure of the Primary VM, causing the Secondary to take over. If this process doesn't go as planned, you could have just right-clicked yourself into an outage.
Once the failover has been initiated, FT immediately starts creating a new Secondary VM to replace the one that just become the Primary. You'll see the FT information on the Summary tab showing that the Secondary VM is being built and that this machine isn't protected yet.
Despite the huge lists of design considerations and prerequisites to fulfill, FT with up to 4 vCPUs can be an immensely helpful tool in certain use cases. The main takeaways from this guide are: consider interoperability (as there are many incompatibilities with other vSphere features); be sure to have 10GbE networking for production; and use the built in prerequisite checker to quickly turn up issues.