In-Depth
What GPU You Really Need for AI Workloads
Selecting the correct GPU is key to being able to efficiently run an AI workload, but with so many different GPUs on the market, how do you know which one is best aligned to your workload's requirements?
As a general rule, memory is one of the most important considerations when selecting a GPU. In some ways, this idea seems counterintuitive. After all, GPU vendors usually focus their marketing messages on GPU performance, not memory. However, GPU performance only matters if you are able to load the model into the GPU and your ability to load an AI model directly depends on how much GPU memory is available. To put it another way, memory is the hard limit that determines what you will or will not be able to do with your GPU.
So how much GPU memory does your workload actually need? In order to answer this question, you need to know what parts of the workload consume VRAM. Obviously, every workload is different, so I am going to stick to discussing some of the more commonly used AI components. In order to keep the discussion manageable, I am also going to limit my discussion to inference workloads. AI training is beyond the scope of this blog post.
The first thing that VRAM is used for is model weights. These weights must be loaded into VRAM before any inferences can begin. The amount of VRAM consumed by the model weights depends on the number of parameters used in the model and the model's numerical precision.
Parameters refer to the number of variables within the model that can be used to control the model's output. Parameters have occasionally been compared to a sound mixer. A sound mixer contains knobs that can be used to control bass, treble, reverb, and that sort of thing. In an AI model, parameters are similar to the knobs on a sound mixer in that adjusting the parameters (or knobs) controls the output.
The numerical precision is exactly what it sounds like. The precision determines how precisely numbers are expressed within the model. Consider as an example, that the value of PI can be expressed as 3.14. It can also be expressed as 3.14159265. Both values are correct, but the second one is more precise.
AI models that use a lower level of precision require less memory and are computationally faster than models with a higher level of precision. However, a lower precision model is not typically going to be able to include as much detail in its output as a higher precision model might.
Model precision levels are expressed as FP (floating point, or decimal) or INT (integer). There is a number that follows, and this number reflects the number of bits that are used to store a single value. As an example, a model with a precision of FP16 uses 2 bytes to store each value. Hence, if the model has 7 billion parameters and uses FP16 precision, that model will require 14 GB, just for the weights.
This is where the concept of quantization comes into play. Quantization is a technique used to make it possible to use a large model on a GPU that would not ordinarily have enough VRAM to load the model's weights. Quantization works by storing numbers with less precision, thus reducing the size of the weights. Some models become less accurate or degrade the output quality as a result of quantization.
AI models will also consume VRAM as a result of activations. You can think of activations as being like a scratch pad that the model uses to "write down" numbers while it is "thinking." These activations are temporary in nature and only stored in memory for as long as it takes the model to process the input. Even so, VRAM is required for storing activations.
Activations are typically smaller than the model weights, but they aren't zero. The activation sizes are determined by factors such as the input size (longer text or bigger images result in larger activations) or the batch size (asking multiple questions at once).
Another component that consumes VRAM is the KV cache. The KV cache's job is to keep track of previously used tokens. Imagine for a moment that you have a large language model powering an AI chatbot. Let's also assume that this chatbot is able to remember its interactions with you over the course of a conversation. The KV cache is a shortcut to being able to remember what was said without having to reread all of the tokens in their entirety.
Some people have compared the KV cache to sticky notes. Imagine for a moment that someone leaves you a voice mail containing a phone number other than the number that they called from. When it comes time to reference this phone number in the future, you could replay the voicemail message. As an alternative however, you could just write down the phone number on a sticky note so that you don't have to replay the message. After all, it's probably faster to glance at the sticky note than to replay the message. That is similar to what the KV cache does.
The KV cache size increases based on the length of the conversation. The greater the number of tokens or the larger the batch size, the larger the KV cache will become. In extreme situations, the KV cache can theoretically grow to be larger than the model weights.
Not surprisingly, one of the biggest mistakes that is often made with regard to estimating the GPU memory needed for AI workloads is to assume that the model size defines the GPU requirements. Unfortunately, determining the VRAM requirements isn't as simple as saying, "the model is 16 GB in size, so I need 16 GB of VRAM." In reality, the VRAM requirements consist of the model weights, KV cache, activations, and overhead all added together. Overhead accounts for things like kernel workspaces, memory alignments, and CUDA buffers. Depending on the model, the overhead may be in the range of 5% to 20%. When all is said and done, the VRAM requirements are often 1.2 to 1.5 times the model size, though this varies from one model to the next. Keep in mind that we are limiting the discussion to inference workloads. Training workloads often require VRAM that is 6 to 8 times larger than the model.
About the Author
Brien Posey is a 22-time Microsoft MVP with decades of IT experience. As a freelance writer, Posey has written thousands of articles and contributed to several dozen books on a wide variety of IT topics. Prior to going freelance, Posey was a CIO for a national chain of hospitals and health care facilities. He has also served as a network administrator for some of the country's largest insurance companies and for the Department of Defense at Fort Knox. In addition to his continued work in IT, Posey has spent the last several years actively training as a commercial scientist-astronaut candidate in preparation to fly on a mission to study polar mesospheric clouds from space. You can follow his spaceflight training on his Web site.