In-Depth
        
        VM Component Protection in vSphere 6
        This new feature makes outages and data loss much less  likely.
        
        
        
Shared storage is the backbone of highly available server  virtualization. With it, workloads can be moved freely between hosts, and  maintenance is a snap. Without it, an environment is substantially less  flexible. In a vSphere environment, there can sometimes be an unfortunate  situation where a shared storage target suddenly and unintentionally becomes  unavailable. 
In previous versions of vSphere (5.0 and earlier), this  situation wasn't handled as gracefully as on would hope. vSphere 5.1 made  improvements in how the loss was handled to make it less catastrophic, and 5.5  improved some more. But in vSphere 6, VMware introduced a great and long-anticipated  feature that protects from the extended outages that plagued previous versions.  This feature is called VM  Component Protection, or VMCP. 
What is it that VMCP is protecting against, and what was so  bad before the feature was introduced? There are actually two potential  failures: a PDL condition and an ADP condition. These affect ESXi and the virtual  machines (VMs) in different ways. 
Permanent Device Loss
  The lesser of two evils, a Permanent Device Loss (PDL)  condition occurs when a storage target is unexpectedly removed from a host, but  the host is told about it. The storage array sends what's called a SCSI sense  code (
here's a list of the  codes) to the host telling it that the LUN has failed, and that it can  assume that it's no longer accessible. 
This condition is problematic, to be sure, but it's less of a  problem than APD because it's more definite. Knowing the determinate status of  the situation allows a host to act accordingly. In the case of a PDL condition,  the host stops issuing I/O to the target because it knows it's inaccessible. 
All-Paths-Down
  All-Paths-Down (APD), on the other hand, is a very  undesirable condition. This occurs when a storage target becomes inaccessible  to a host, with no notification and no ability to contact the storage array.  The precarious situation this leaves the host in is that it doesn't know  whether the device loss is permanent (due to the LUN being failed/destroyed,  zoning changes and so on) or whether it's temporary (due to a momentary network  outage, a configuration error that will take 10 seconds to revert and so on). Improperly  removing storage that's being decommissioned can inadvertently cause an APD, so  be sure to reference 
the VMware  Knowledge Base article on how to properly remove storage.
Because there's potential for the device to become  accessible again, ESXi continues retrying I/O operations. This is in contrast  to PDL, where it gives up right away. Due to the continued unsuccessful I/O,  especially from userworld processes like hostd,  the ESXi host can eventually become unresponsive and unmanageable. 
There's generally no resolution but to reboot the host.  Because of this high potential for outages, properly handling APD conditions so  this doesn't happen has been on VMware's to-do list, and with vSphere 6 they  finally did it. 
VM Component Protection
  VM Component Protection (VMCP) is a marketing-friendly way  of saying that you can now configure the response to PDL and APD conditions as  it relates to VMs directly from the High Availability (HA) configuration  screen. 
When configuring the HA settings on a cluster object, there's  a new section called "Host Hardware  Monitoring – VM Component Protection" (Figure  1). Under this heading, the user has the option to configure unique  responses to both PDL and APD conditions, as well as configure the timer for  ADP, which ensures that a temporary network blip doesn't initiate a massive HA  failover. Do note that as with all new features, you can only configure this  from the Web client.
	
    
    
	
		[Click on image for larger view.]	
		Figure 1. The VM Component Protection screen.
	
For PDL responses, the following actions can be taken:
  - Disabled.  No action is taken.
 
  - Issue  events. No action is taken, but an alert is shown.
 
  - Power off  and restart VMs. Affected VMs will be failed over by HA to a host that has  connectivity to the respective datastore.
 
For APD events, the following actions can be taken:
  - Disabled.  No action is taken.
 
  - Issue  events. No action is taken, but an alert is shown.
 
  - Power off  and restart VMs (conservative). HA slave nodes will communicate with the  master in an attempt to find a host where machine could be powered on and run  successfully. Only when a healthy host is identified will an HA failover take  place.
 
  - Power off  and restart VMs (aggressive). HA slave nodes will attempt to communicate  with the HA master to find a suitable location to fail over VMs to. If  communication with the master node isn't possible, HA will attempt the failover  anyway. This carries the risk of not being able to power VMs back on, but is  desirable in a network partition condition where a suitable host does exist but  the HA slave can't communicate with the master.
 
The timer for action after an APD status is detected can be  configured, along with the action taken when the timer expires. The action can  be set to either Disabled or Reset VMs, which would cause HA to hard  reset all VMs, but on the same host they were already running on. 
By leveraging VMCP, which is not turned on by default and  must be configured, a vSphere cluster can provide higher levels of availability  to applications than ever before. 
        
        
        
        
        
        
        
        
        
        
        
        
            
        
        
                
                    About the Author
                    
                
                    
                    vExpert James Green has roughly a decade of experience as an IT administrator, architect and consultant in a variety of organizations. He's highly certified, and continues to purse professional certifications to increase his breadth and depth of knowledge. He has always been passionate about writing and speaking, and discussing the marriage of cutting-edge technology and business is one of his favorite activities. He works for ActualTech Media, www.actualtech.io.