The Cranky Admin

The Next IT Frontier: Adaptive Orchestration

Getting a handle on skyrocketing workloads.

Distributed computing is hard, large clusters are hard, parallel computing is really hard, and Gustafson's law is a pain in the ASCII. With containers, we're already at the point that we can stuff tens of thousands of workloads into 10 rack units of space, and to be brutally honest, we're not very good at coordinating that. What happens when tens of thousands becomes hundreds of thousands, or millions, of workloads?

Today's infrastructure is experiencing a generational leap in capability. Breakneck speeds bring their own problems to datacenter design, which ensures, among other things, that we'll always need systems integrators. Hardware Compatibility Lists are kind of crap, which makes rolling our own next-generation software-defined win machine somewhat problematic. On top of it all, increased workload density is both blessing and curse.

After more than a decade of relatively steady -- bordering on stagnant -- increases in systems capability, everything is coming to a head. The last time we ran across a change of this magnitude we were consolidating workloads from bare metal using virtualization, and driving performance with increasingly fancy centralized storage. We had to change everything about how we managed workloads then. We're going to have to now as well.

The Lethargy of Volume
At the turn of the millennium we had to dramatically rethink storage because of virtualization. Not only did shared storage enable critical functionality such as vMotion and High Availability (HA), but we started cramming rather a lot of workloads onto a limited number of storage devices. The SCSI RAID cards we jammed into each host just wouldn't cut it as centralized storage solutions.

Fibre channel SANs got lots of love; the gigabit Ethernet of the day was just not up to the task. Demand on storage and networking both increased. Fibre channel speeds increased every few years. Gigabit Ethernet begat 10 Gigabit Ethernet (10GbE). Eventually, 40GbE and 100GbE were also born.

10GbE saw reasonably widespread adoption, but 40GbE and 100GbE didn't see much love. Prices were too high. That was okay; storage and networking didn't impinge upon one another, and most of us weren't really stressing our networks.

Eventually, fibre channel vendors got greedy. The push to do storage over Ethernet became serious, made all the more so by the emergence of software-defined storage (SDS) and hyperconvergence. Ethernet switches evolved to solve latency issues, microbursting issues, and The Great Jumbo Frames Debate almost became a thing.

Just as it looked like we might have to either start actually caring about jumbo frames or kowtow to the fibre channel mafia, a new networking standard with 25GbE, 50GbE and a new 100GbE emerged. Significantly cheaper than its predecessors, it will let us stave off efficiencies like jumbo frames for a few more years.

For the longest time, the storage side wasn't much different. Fibre channel doubled in speed every now and again, but the bottleneck was in the box. IDE to SATA, SCSI to SAS; iteration after iteration slowly allowed for faster systems. Hard drives never really got much faster; for years we just kept adding shelves full of disks, in a desperate attempt to solve the performance problems brought about by continuing consolidation and the ceaseless demand for more workloads.

Then along came flash.

Software-Defined Transformation
Flash is too fast for SATA, and it's too fast for SAS. We didn't really notice this in the beginning because in addition to using standards that couldn't take advantage of flash disks, the RAID controllers and HBAs of the time had such low queue depths they couldn't even make full use of the standards of the time. This was around 2010. Hyperconverged vendors yelled at storage controller vendors, and things slowly started to suck less.

NVMe came out. We hooked flash up directly to the PCIe bus. This broke our SDS solutions because we couldn't get enough networking in. We lashed 10GbE ports together, and eventually 100GbE ports. Containerization went mainstream, and suddenly 150 workloads per box became 2,500.

Snowden happened. American judges started making extrajudicial demands for data. Brexit. Trump. The fig leaf of Safe Harbor was blown away and the EU's General Data Protection Regulation (GDPR) demanded both privacy and security by design and by default.

Now we not only needed to store and move and run everything at the power of raw rediculosity, we needed to encrypt it, too. Table stakes became encryption at rest, encryption in flight, and with realtime deduplication and compression to boot. The slow, steady pace of the early 21st century wasn't -- isn't -- enough.

That jumbo frames debate will be back soon enough. RDMA, NVMe and other technologies focused on marked improvements in efficiency to provide better throughput and/or latency without needing new standards are about to be the default, not some add-on only used by the dark priests of niche datacenters.

A Game of Risk
The 00s were a slow, even boring march of sequential iterations in technology. This predictability had value. We knew what to expect, so we could make reasonable judgements about technology investments without taking a big risk that we'd be caught unaware by some massive leap in capability. Many of us even became comfortable stretching out beyond the vendors' preferred three-year refresh horizon.

This is no longer the case. When we stop solving problems by throwing raw throughput at it we have to learn something new. When we stopped simply ramping up the clock frequency of processors we had to learn to get good at multiple cores. Physics said no, and we adapted.

This time, it's not just CPUs that are hitting the wall. Today's equipment might not speak the more efficient protocols of tomorrow, and we need to start thinking about that. Today the focus is on building private clouds and trying to treat all infrastructure merely as a commodity to be consumed.

The next hurdle will be how to maintain those pools of resources when the older chunk of our cloud doesn't speak the same language as the newer bit. You wouldn't, for example, want your front-end application living on a network cluster that spoke jumbo frames and your database cluster on one that didn't.

The Next Big Thing is Adaptive Orchestration. This means automating workload interdependence identification and then enforcing locality based upon performance, latency, risk profile, resiliency requirements, RPO, RTO as well as regulatory, governance and data residency requirements.

Pets vs. cattle defined the transition from a few hundred to tens of thousands of workloads. Adaptive Orchestration will enable us to handle millions.

About the Author

Trevor Pott is a full-time nerd from Edmonton, Alberta, Canada. He splits his time between systems administration, technology writing, and consulting. As a consultant he helps Silicon Valley startups better understand systems administrators and how to sell to them.


Subscribe on YouTube