In-Depth
Using Speculative Decoding to Improve Chatbot Performance
Like other enterprise workloads, AI applications must be designed with scalability and performance in mind. After all, it's one thing to build a chatbot, but it's quite another thing to create a chatbot that can service hundreds or thousands of concurrent requests without overburdening the underlying hardware. In an effort to help AI chatbots service more demanding workloads with less overall hardware impact, organizations are adopting a technique called speculative decoding.
Speculative decoding is one of those things that works, but is completely counterintuitive on so many levels, starting with the idea that a chatbot can consume fewer hardware resources by leveraging multiple AI models. By definition, each additional model that is used places an additional load on the system. However, speculative decoding uses its models in a way that can significantly decrease the overall load on the hardware, making it possible to increase throughput without adding hardware.
Here is the basic idea: Smaller models are faster and less demanding than larger models, but larger models are generally more capable. As such, an AI application can use a small model to generate a response to a prompt and then use a larger model to "fact check" the smaller model and to fill in any knowledge gaps. Technically, the larger model is not fact checking in the human sense of the phrase, but fact checking is a good analogy.
So now that I have given you a general -- albeit not technically correct -- idea of what's going on, let's take a look at how speculative decoding actually works and why it is so often able to decrease the hardware workload.
How Chatbots Generate Tokens
The first thing that you have to understand is that an AI chatbot does not know information in the sense that we as humans do. As an example, a chatbot might be able to tell you that Ottawa is the capital of Canada, but it does not actually "know" the Canadian capital. Instead, it uses a complex system of probabilities to figure it out on the fly.
AI chatbots work by using probabilities to predict the next token within the output. Think of this as being like predicting the next word in a sentence, although in reality there is not a one-to-one relationship between tokens and words. A normal chatbot that is based on a single model predicts exactly one token. It then has to run the entire neural network again to predict the next token in the series. If the output is 500 tokens in length -- about 375 words -- then the neural network will have to run 500 times. Needless to say, this is expensive from a hardware standpoint.
But what if the AI could predict multiple tokens instead of just one? That would dramatically reduce the number of times that the model has to run, thereby lessening the demand on the hardware. The model can't do that exactly, but the large model can validate several proposed tokens in a single forward pass rather than generating each of those tokens individually. This is the entire basis of speculative decoding.
How Speculative Decoding Works
Remember what I said about smaller models being smaller, faster and cheaper? What if you could use a small model to generate the tokens and then use a large model to validate those tokens? Here's how it works:
The process begins with a user entering a prompt. That prompt is sent to the large model and the large model is instructed to generate a single token. That token and the prompt itself are then handed off to the small model. The small model then generates the next N tokens. Those tokens are then sent back to the larger model, which looks at the tokens and determines whether those are the same tokens that it would have generated. If so, then the tokens are accepted. If not, then at least some of the tokens are rejected.
This brings up a really important point. The act of accepting or rejecting tokens is not all or nothing. Let's pretend for a moment that a small model returns 10 tokens and the first seven tokens match what the large model would have produced, but the eighth token is a mismatch. In a situation like this, the large model would keep the first seven tokens and discard tokens eight, nine and 10. It would then generate its own "token eight," which it would then pass to the small model, thereby continuing the process.
The important thing to understand about this sequence is that the end result is identical to what the large model would have produced by itself. Remember, for every token that the small model produces, the large model checks to see if the token matches what it would have produced. If the token doesn't match, then it is rejected. Ultimately, the output is completely identical to what the large model would have created on its own.
This raises the question of how this process is computationally more efficient than just letting the big model handle the prompt in the usual way.
Why the Technique Can Reduce Hardware Demand
The answer stems from the fact that the smaller model is way faster and more efficient than the large model. If the user is asking something simple such as, "What is the capital of Canada?" then the small model can easily answer that on its own. Yes, the big model validates the output, but the large model in this example is not generating any replacement tokens.
But what if the user's prompt asks about something complicated like rocket science or biochemistry? After all, a small model probably does not contain knowledge of advanced sciences. Even so, speculative decoding can still help.
Remember what I said earlier. A model does not "know" the material as humans do. It works by using probabilities to predict the next token in the sequence. In the English language, there are a lot of noise words such as the, at, in, a and an. These noise words make up a significant percentage of even the most complex scientific paper. So even if the small model knows nothing about biochemistry or rocket science, it can still help to fill in the gaps between the more important words.
Where Speculative Decoding Falls Short
Speculative decoding can potentially boost performance and decrease hardware demand for most AI chatbot workloads, but there is one big exception. If the chatbot is designed for creative writing or poetry, it is going to be very difficult for the small model to come up with the same tokens as the large model does. As such, the vast majority of tokens will probably be rejected. In this worst-case scenario, the speculative decoding mechanism fails to deliver any improvement in speed or efficiency. On the bright side, however, the large model is effectively doing all the work in this situation, so the quality of the output is not degraded.
At first glance, it might seem as though through speculative decoding, the large model is doing twice as much work by verifying the draft. In reality, transformers are highly parallel. Once the smaller model proposes several tokens, the larger model can evaluate all of those positions during a single forward pass. That is much cheaper and faster than generating those same tokens one at a time, which is why speculative decoding can improve throughput even though two models are involved.
About the Author
Brien Posey is a 22-time Microsoft MVP with decades of IT experience. As a freelance writer, Posey has written thousands of articles and contributed to several dozen books on a wide variety of IT topics. Prior to going freelance, Posey was a CIO for a national chain of hospitals and health care facilities. He has also served as a network administrator for some of the country's largest insurance companies and for the Department of Defense at Fort Knox. In addition to his continued work in IT, Posey has spent the last several years actively training as a commercial scientist-astronaut candidate in preparation to fly on a mission to study polar mesospheric clouds from space. You can follow his spaceflight training on his Web site.