In-Depth
        
        Using SMB Direct in the Real World
        As storage changes, Microsoft has kept up  with its latest offerings. See how fast it can be with this hands-on test.
        
        
        
  When Microsoft decided to create its own software-defined  storage (SDS) stack, first released in Windows Server 2012, it had a good look  at the existing offerings in the market. The staple of enterprise storage for  many years has been SANs, Fibre Channel (FC) if you could afford it, and iSCSI  if you couldn't. Simply decide on a vendor, capacity, which extra features you  want to license, and write a (big) check. 
  Storage is changing, however, and Microsoft's  ownership of the entire storage stack through SDS, and building on top of  simple disk enclosures, opens up some interesting alternatives for storage that  you may not have considered. 
  Microsoft SDS in a Nutshell
  To fully appreciate the importance of SMB  Direct and RDMA networking, we first need to look at the components of  Microsoft's SDS and what it can (and can't) be used for. 
  A Scale-Out File  Server (SOFS) is composed of clustered Windows Server file servers. They deliver  storage as file shares to workload servers. Should one of these servers fail,  the others take over very quickly (much quicker than waiting for TCP/IP  timeouts) in a process called Transparent Failover, ensuring no interruption in  workload IO. Common SOFS cluster sizes are two, three or four nodes. 
  The actual storage is  provided by either a SAN (FC or iSCSI) or by Storage Spaces. In the case of the  former, the benefit is that you need less wiring. If you have a 16-node Hyper-V  cluster connected directly to a FC SAN, you'll need 32 Host Bus Adapters (HBAs):  two each for redundancy, along with cables and switches. If you instead place a  three-node SOFS cluster in front of the SAN, you'll only need six adapters and  a smaller switch. 
  If Storage Spaces is used,  you'll have one or more Just a Bunch of Disks (JBOD) enclosures connected to  each SOFS node through external SAS cables. These disk arrays have very little  smarts built in and no RAID backplanes; just slots for HDD and SSD disks.  You'll hear Microsoft engineers speaking of 2x2 configurations (or 3x3/4x4) which  means two cluster nodes connected to two disk arrays. The enclosures communicate  their state to Windows, so if you have at least three enclosures you get  resiliency to a whole enclosure failing. Microsoft has an excellent guide and accompanying Excel worksheet for planning  an SOFS cluster.
  Tiering Up
  Physical disks are  pooled together, and out of each pool you carve out one or more virtual disks  (the equivalent of LUNs in SANs) with resiliency characteristics such as two-way  or three-way mirroring or parity. Performance is ensured through storage  tiering, where storage spaces creates two performance tiers: one for HDD and  one for SSD. Frequently accessed blocks are moved to the SSD tier, while colder  data is moved to the HDD tier.
  The workloads can be  of two types: Hyper-V clusters, where the virtual disks of all the virtual  machines (VMs) are stored on the SOFS shares; and SQL server, where the  databases are accessed through file shares. Connecting these workload hosts  with the SOFS shares is SMB 3.02, the venerable file sharing protocol that's  been around for many versions of Windows, and received a thorough overhaul in  Windows 2012. 
  The solution described here matches SANs  feature-by-feature, with one exception: the connection from the Hyper-V or SQL server  nodes to the shared storage. To really match high-end SAN performance, 1Gbps or  even 10Gbps Ethernet networking doesn't really cut it. Enter SMB Direct, built  on Remote Direct Memory Access  (RDMA) hardware. In Windows Server 2012, SMB Direct is used only to connect the  workloads to the storage; in 2012 R2 Hyper-V, hosts can also use SMB Direct to  Live Migrate VMs from one host to another. 
  RDMA is a technology that first saw life in  the High Performance Computing  (HPC) world, which uses large numbers of nodes to collectively work on large  datasets of financial or scientific data. High speed, low CPU load/low latency  interconnects between nodes is essential for HPC, hence the development of  RDMA. Today there are three flavors of RDMA: 
  - Mellanox's  Infiniband goes up to the recently released 100Gbps, and requires a separate  infrastructure (similar to how you have to setup FC host bus adapters, cabling  and switches). This negates some of the benefits of converging the  infrastructure.
- Mellanox  also offers RDMA over Converged Ethernet (RoCE). This provides a single,  Ethernet-based infrastructure, similar to how you can use Fibre Channel over  Ethernet (FCoE) solution to run FC traffic over Ethernet networks. Some of the VMs  you can run in Azure IaaS come with Infiniband networking. 
- Internet  Wide Area RDMA Protocol (iWARP) from Chelsio,  which has been around since 2007. Currently the top speed here is 40Gbps, but  it's noteworthy that Microsoft chose Chelsio RDMA networking for their Cloud  Platform Solution (CPS). CPS is a turnkey private cloud implementation,  delivered in one to four racks. 
For an excellent  comparison of the different flavors of RDMA, see this article. 
  The Future Is S2D
  The SDS solution  described so far has two limitations: it doesn't scale well to larger  environments, or work in small networks. Each JBOD tray tops out at around 80  disks (HDD + SSD), and the maximum amount of disks in each pool is 84. The  disks and SSDs all have to be SAS (which equals higher cost), and there are  also restrictions on expanding storage easily; you can't just plug a few more  drives in when you need them due to technical limitations. Finally, for very  small environments, the complexity and number of nodes needed for separate SOFS  and Hyper-V hosts preclude deployments in SMB and branch offices. 
  The potential solution  to this problem is coming in Windows Server 2016. The current Technical Preview  2 (TP2) offers support for a new technology called Storage Spaces Direct (S2D).  Instead of external JBOD arrays, the storage internal to each storage host is  utilized (this can be external SAS connected storage, but it only needs to be  hooked up to that host, not to all of them). 
  SATA drives are now  supported, along with NVMe (essentially flash memory on PCI Express cards), bringing costs down.  And S2D will support (not in TP2) hyper-converged infrastructure where the  storage hosts are also Hyper-V hosts. 
  RDMA will be a  requirement for connecting the S2D hosts together so that storage traffic can  be rapidly distributed. It will support the same two- and three-way mirroring  configurations, along with parity data resiliency. 
  Putting RDMA To the  Test
  Chelsio gave me two  NICs with two 40Gbps ports (T580-CR) and two NICs with two 10Gbps ports (T520-LL), which I installed in my test lab. The lab  has two Windows Server 2012 R2 Hyper-V hosts and two Windows Server 2016 TP2  hosts; each host has 32GB RAM. Storage is provided by a Windows Server 2012 R2  host with three HDD and two SSD drives, using storage tiering and two-way  mirroring. 
  I started by  installing the two 40Gbps cards in the Windows Server 2012 R2 Hyper-V hosts.  This gives an effective bandwidth of 80Gbps between them. Driver installation  was easy, with a single driver download that supports all Chelsio network  adapters. Windows automatically discovers NICs that support RDMA and turns on  SMB Direct (you can disable it if desired with Disable-NetAdapterRdma in  PowerShell).
  
  
	
     [Click on image for larger view.]	
		Figure 1. Performance Monitor checks on counters during  the Live Migrations.
    
	
		[Click on image for larger view.]	
		Figure 1. Performance Monitor checks on counters during  the Live Migrations.
	
I created a Windows  2012 R2 VM and allocated it 24GB of static virtual memory. In this VM I ran HeavyLoad to make sure all the memory was utilized  before Live Migrating (LM) the VM from host 1 to host 2. On average, these LMs  took between 8 and 10 seconds. 
  Now think about your  own cluster, and a typical patch Tuesday: you have to move all VMs from one host  to other hosts before performing the required maintenance, then move them back.  Then you have to repeat the process on the next host, and on it goes. Speeding  up this process could be very valuable in the real world. Note that my CPU  utilization during these moves stayed consistently under 10 percent, due to the  Chelsio NICs offloading the data transfer altogether from the hosts. 
	
     [Click on image for larger view.]	
		Figure 2. In this test, HeavyLoad is consuming memory in the VM before  the Live Migration.
    
	
		[Click on image for larger view.]	
		Figure 2. In this test, HeavyLoad is consuming memory in the VM before  the Live Migration.
	
  
  Testing, Part 2
  My second test  involved connecting moving one of the 40Gbps cards to my storage host,  effectively connecting one host to the storage on an 80Gbps link. I created  another VM with Windows Server 2012 R2 and allocated it a virtual data disk,  which I pinned to the SSD tier. Pinning means manually configuring a whole file  to live on a particular tier, which is useful for VDI scenarios in which, for  instance, the gold master image file is accessed by many clients  simultaneously.  For this test, it  ensured that I got the maximum performance from my two (desktop class) SSDs. 
  I ran DiskSPD inside the VM to test IO speeds at different  IO sizes. You might have seen presenters use SQLIO for demonstrating storage performance, and  while it's still available, DiskSPD is the preferred solution now. My results varied from 300 MB/s for small  IO sizes, to just over 530 MB/s for large IO sizes, with latency between 1.6 -  15 milliseconds. 
	
     [Click on image for larger view.]	
		Figure 3.  The IO results from DiskSPD speaks volumes; half  a GB per second isn't bad for two desktop class SSD drives.
    
	
		[Click on image for larger view.]	
		Figure 3.  The IO results from DiskSPD speaks volumes; half  a GB per second isn't bad for two desktop class SSD drives.
	
  The CPU utilization on  the host during these tests hovered around 5 percent. The equivalent tests  using the 10Gbps cards, set to non-RDMA mode, resulted in more than 80 percent  CPU utilization. With RDMA mode enabled the 10Gbps cards achieved very similar  results, simply due to the fact that my two SSD can't provide IO fast enough  for these network speeds. 
  To match the 80 Gbps I  would have needed something like 160 SSD drives. One tip: If you're doing  storage testing on your own storage spaces setup, don't use file copy; the results won't be representative of the  kind of performance you're going to get with Hyper-V and SQL workloads. 
 
  We Have a Winner
  These Chelsio NICs  have been a joy to work with. The technical support when I needed clarification  of configuration settings was swift and accurate, and the NICs are easy to  install and configure. They're also cost effective, so if you're considering a  storage spaces or storage spaces direct implementation, you owe it to yourself  to put Chelsio RDMA technology on your short list.