In-Depth
        
        Large Language Model Selection -- Why the Parameter Count Isn't Everything
        
        
        
        
  When choosing a large language model (LLM) for use in a particular task, one of the first things that people often look at is the model's parameter count. A vendor might offer several different versions of a particular model, with each version having a different parameter count. The natural assumption is that the higher parameter count, the better the model. After all, a model with 100 billion parameters just has to be better than a 7 billion parameter model, right? Unfortunately, things aren't always quite that simple. 
  
  While the parameter count is often a reflection of a model's capabilities, the parameter count alone does not guarantee that one model is better than another. To put this into perspective, consider that a large orchestra does not automatically play better than a smaller one. It's entirely possible that the smaller orchestra has recruited better musicians than the larger one. Similarly, it's possible that a Large Language Model with a smaller parameter count might do a better job than a model with a larger parameter count. It ultimately comes down to factors such as architecture, training, and data quality. Sure, having a large number of parameters can be a good thing, but the overall parameter count is meaningless if the model is based on bad data.
  
  In order to fully appreciate the roll that parameters play in an LLM, it's necessary to understand what a parameter is and what it does for the model. Parameters are sometimes referred to as "knobs," "settings" or "variables." They essentially refer to the model's ability to adjust to fit the data. To use an analogy from every day life, you can think of parameters as being like the ingredients that a chef uses while preparing a dish. A chef might make a simple dish with just a few ingredients, or they could instead make something really complex that involves a lot of ingredients. Of course using a lot of ingredients does not automatically make a dish better.
  
  It's also worth noting that there are different types of parameters. Two of the most widely used parameter types are weights and biases. Weights determine how strongly one piece of information influences another. Going back to the cooking analogy, adding a dash of ground Carolina Reaper (the world's hottest pepper) to a dish would have a bigger impact on the dish's overall taste than adding a dash of a mild seasoning such as coriander. 
  
  In an LLM, biases are designed to provide baseline offsets. If we were to continue with the cooking analogy, a bias could be thought of as being what the dish tastes like before you add any seasoning at all.
  
  Another important consideration with regard to selecting a LLM is the number of active parameters. Often times, a model's parameter count is specified using an abbreviation such as 70B, which reflects the use of 70 billion parameters. Occasionally however, you might see a model's parameter count listed with two numbers, indicating the total number of parameters and the number of active parameters. As an example, 128x17B would indicate that the model has 128 billion total parameters, but that only 17 billion of those parameters are in use at any given time. 
  
  Although relying on a subset of a model's parameters rather than using all of the available parameters probably sounds like a bad thing, not every prompt needs to use all of a model's parameters. Imagine that a large company with thousands of employees needs to complete a particular project. That company probably isn't going to engage every single employee. Instead, it will leverage the skills of those employees who are best suited to the task at hand. In the world of LLMs, using a subset of parameters works the same way.
  
   Better still, building a model to use a subset of the available parameters tends to allow the model to respond more quickly to prompts, while also reducing the amount of memory required. 
  
  Using a subset of the available parameters also helps to avoid a problem known as overfitting. Overfitting is a condition that happens when a model becomes so finely tuned that it begins to have trouble with providing good answers to end user prompts. You can think of overfitting as being kind of like a student who memorizes an entire text book word for word, but doesn't actually understand any of it. If someone were to ask that student something that is discussed in the textbook, the student could indeed recite the answer. However, if that same student were asked to make an inference based on something that they read, they would have trouble doing so, because they don't understand the material. 
  
  Overfitting works in roughly the same way. The model doesn't just learn the training data, it learns idiosyncrasies such as spelling errors or the author's writing style. When these same idiosyncrasies are absent from end user prompts, the model doesn't quite know how to respond and tends to generate wrong answers. To put it another way, overfitting means that the model has memorized the data too well. As a result, the model becomes good at reciting that data, but bad at making generalizations based on the data. Using a subset of parameters helps to prevent this problem.
  
  Since the number of parameters associated with an LLM is not necessarily a reflection of how well the model will perform, what should you consider when selecting a model? 
  
  Certainly, it's worth at least considering the parameter count, but there are a few other things that you should consider. First, how well does the model align with the task that you will be using it for? Some models are intended for general use cases, while others are bult for very specific use cases.
  
  Another consideration is the model's benchmarks. Benchmarks provide standard metrics for comparing one large language model to another in terms of performance and accuracy.
  
  Finally, you should also pay attention to the model's context size. The context size refers to the number of tokens that the model can evaluate at once. Models that have a larger context size can handle lengthier prompts or longer conversation histories than models that have a shorter context. 
  
        
        
        
        
        
        
        
        
        
        
        
        
            
        
        
                
                    About the Author
                    
                
                    
                    Brien Posey is a 22-time Microsoft MVP with decades of IT experience. As a freelance writer, Posey has written thousands of articles and contributed to several dozen books on a wide variety of IT topics. Prior to going freelance, Posey was a CIO for a national chain of hospitals and health care facilities. He has also served as a network administrator for some of the country's largest insurance companies and for the Department of Defense at Fort Knox. In addition to his continued work in IT, Posey has spent the last several years actively training as a commercial scientist-astronaut candidate in preparation to fly on a mission to study polar mesospheric clouds from space. You can follow his spaceflight training on his Web site.