In-Depth
Using SMB Direct in the Real World
As storage changes, Microsoft has kept up with its latest offerings. See how fast it can be with this hands-on test.
When Microsoft decided to create its own software-defined storage (SDS) stack, first released in Windows Server 2012, it had a good look at the existing offerings in the market. The staple of enterprise storage for many years has been SANs, Fibre Channel (FC) if you could afford it, and iSCSI if you couldn't. Simply decide on a vendor, capacity, which extra features you want to license, and write a (big) check.
Storage is changing, however, and Microsoft's ownership of the entire storage stack through SDS, and building on top of simple disk enclosures, opens up some interesting alternatives for storage that you may not have considered.
Microsoft SDS in a Nutshell
To fully appreciate the importance of SMB Direct and RDMA networking, we first need to look at the components of Microsoft's SDS and what it can (and can't) be used for.
A Scale-Out File Server (SOFS) is composed of clustered Windows Server file servers. They deliver storage as file shares to workload servers. Should one of these servers fail, the others take over very quickly (much quicker than waiting for TCP/IP timeouts) in a process called Transparent Failover, ensuring no interruption in workload IO. Common SOFS cluster sizes are two, three or four nodes.
The actual storage is provided by either a SAN (FC or iSCSI) or by Storage Spaces. In the case of the former, the benefit is that you need less wiring. If you have a 16-node Hyper-V cluster connected directly to a FC SAN, you'll need 32 Host Bus Adapters (HBAs): two each for redundancy, along with cables and switches. If you instead place a three-node SOFS cluster in front of the SAN, you'll only need six adapters and a smaller switch.
If Storage Spaces is used, you'll have one or more Just a Bunch of Disks (JBOD) enclosures connected to each SOFS node through external SAS cables. These disk arrays have very little smarts built in and no RAID backplanes; just slots for HDD and SSD disks. You'll hear Microsoft engineers speaking of 2x2 configurations (or 3x3/4x4) which means two cluster nodes connected to two disk arrays. The enclosures communicate their state to Windows, so if you have at least three enclosures you get resiliency to a whole enclosure failing. Microsoft has an excellent guide and accompanying Excel worksheet for planning an SOFS cluster.
Tiering Up
Physical disks are pooled together, and out of each pool you carve out one or more virtual disks (the equivalent of LUNs in SANs) with resiliency characteristics such as two-way or three-way mirroring or parity. Performance is ensured through storage tiering, where storage spaces creates two performance tiers: one for HDD and one for SSD. Frequently accessed blocks are moved to the SSD tier, while colder data is moved to the HDD tier.
The workloads can be of two types: Hyper-V clusters, where the virtual disks of all the virtual machines (VMs) are stored on the SOFS shares; and SQL server, where the databases are accessed through file shares. Connecting these workload hosts with the SOFS shares is SMB 3.02, the venerable file sharing protocol that's been around for many versions of Windows, and received a thorough overhaul in Windows 2012.
The solution described here matches SANs feature-by-feature, with one exception: the connection from the Hyper-V or SQL server nodes to the shared storage. To really match high-end SAN performance, 1Gbps or even 10Gbps Ethernet networking doesn't really cut it. Enter SMB Direct, built on Remote Direct Memory Access (RDMA) hardware. In Windows Server 2012, SMB Direct is used only to connect the workloads to the storage; in 2012 R2 Hyper-V, hosts can also use SMB Direct to Live Migrate VMs from one host to another.
RDMA is a technology that first saw life in the High Performance Computing (HPC) world, which uses large numbers of nodes to collectively work on large datasets of financial or scientific data. High speed, low CPU load/low latency interconnects between nodes is essential for HPC, hence the development of RDMA. Today there are three flavors of RDMA:
- Mellanox's Infiniband goes up to the recently released 100Gbps, and requires a separate infrastructure (similar to how you have to setup FC host bus adapters, cabling and switches). This negates some of the benefits of converging the infrastructure.
- Mellanox also offers RDMA over Converged Ethernet (RoCE). This provides a single, Ethernet-based infrastructure, similar to how you can use Fibre Channel over Ethernet (FCoE) solution to run FC traffic over Ethernet networks. Some of the VMs you can run in Azure IaaS come with Infiniband networking.
- Internet Wide Area RDMA Protocol (iWARP) from Chelsio, which has been around since 2007. Currently the top speed here is 40Gbps, but it's noteworthy that Microsoft chose Chelsio RDMA networking for their Cloud Platform Solution (CPS). CPS is a turnkey private cloud implementation, delivered in one to four racks.
For an excellent comparison of the different flavors of RDMA, see this article.
The Future Is S2D
The SDS solution described so far has two limitations: it doesn't scale well to larger environments, or work in small networks. Each JBOD tray tops out at around 80 disks (HDD + SSD), and the maximum amount of disks in each pool is 84. The disks and SSDs all have to be SAS (which equals higher cost), and there are also restrictions on expanding storage easily; you can't just plug a few more drives in when you need them due to technical limitations. Finally, for very small environments, the complexity and number of nodes needed for separate SOFS and Hyper-V hosts preclude deployments in SMB and branch offices.
The potential solution to this problem is coming in Windows Server 2016. The current Technical Preview 2 (TP2) offers support for a new technology called Storage Spaces Direct (S2D). Instead of external JBOD arrays, the storage internal to each storage host is utilized (this can be external SAS connected storage, but it only needs to be hooked up to that host, not to all of them).
SATA drives are now supported, along with NVMe (essentially flash memory on PCI Express cards), bringing costs down. And S2D will support (not in TP2) hyper-converged infrastructure where the storage hosts are also Hyper-V hosts.
RDMA will be a requirement for connecting the S2D hosts together so that storage traffic can be rapidly distributed. It will support the same two- and three-way mirroring configurations, along with parity data resiliency.
Putting RDMA To the Test
Chelsio gave me two NICs with two 40Gbps ports (T580-CR) and two NICs with two 10Gbps ports (T520-LL), which I installed in my test lab. The lab has two Windows Server 2012 R2 Hyper-V hosts and two Windows Server 2016 TP2 hosts; each host has 32GB RAM. Storage is provided by a Windows Server 2012 R2 host with three HDD and two SSD drives, using storage tiering and two-way mirroring.
I started by installing the two 40Gbps cards in the Windows Server 2012 R2 Hyper-V hosts. This gives an effective bandwidth of 80Gbps between them. Driver installation was easy, with a single driver download that supports all Chelsio network adapters. Windows automatically discovers NICs that support RDMA and turns on SMB Direct (you can disable it if desired with Disable-NetAdapterRdma in PowerShell).
I created a Windows 2012 R2 VM and allocated it 24GB of static virtual memory. In this VM I ran HeavyLoad to make sure all the memory was utilized before Live Migrating (LM) the VM from host 1 to host 2. On average, these LMs took between 8 and 10 seconds.
Now think about your own cluster, and a typical patch Tuesday: you have to move all VMs from one host to other hosts before performing the required maintenance, then move them back. Then you have to repeat the process on the next host, and on it goes. Speeding up this process could be very valuable in the real world. Note that my CPU utilization during these moves stayed consistently under 10 percent, due to the Chelsio NICs offloading the data transfer altogether from the hosts.
Testing, Part 2
My second test involved connecting moving one of the 40Gbps cards to my storage host, effectively connecting one host to the storage on an 80Gbps link. I created another VM with Windows Server 2012 R2 and allocated it a virtual data disk, which I pinned to the SSD tier. Pinning means manually configuring a whole file to live on a particular tier, which is useful for VDI scenarios in which, for instance, the gold master image file is accessed by many clients simultaneously. For this test, it ensured that I got the maximum performance from my two (desktop class) SSDs.
I ran DiskSPD inside the VM to test IO speeds at different IO sizes. You might have seen presenters use SQLIO for demonstrating storage performance, and while it's still available, DiskSPD is the preferred solution now. My results varied from 300 MB/s for small IO sizes, to just over 530 MB/s for large IO sizes, with latency between 1.6 - 15 milliseconds.
The CPU utilization on the host during these tests hovered around 5 percent. The equivalent tests using the 10Gbps cards, set to non-RDMA mode, resulted in more than 80 percent CPU utilization. With RDMA mode enabled the 10Gbps cards achieved very similar results, simply due to the fact that my two SSD can't provide IO fast enough for these network speeds.
To match the 80 Gbps I would have needed something like 160 SSD drives. One tip: If you're doing storage testing on your own storage spaces setup, don't use file copy; the results won't be representative of the kind of performance you're going to get with Hyper-V and SQL workloads.
We Have a Winner
These Chelsio NICs have been a joy to work with. The technical support when I needed clarification of configuration settings was swift and accurate, and the NICs are easy to install and configure. They're also cost effective, so if you're considering a storage spaces or storage spaces direct implementation, you owe it to yourself to put Chelsio RDMA technology on your short list.