Running AI on a Raspberry Pi, Part 1: Overview -- Virtualization Review

How-To

Running AI on a Raspberry Pi, Part 1: Overview

By Tom Fenton
03/30/2026

In a recent article, I looked at the Raspberry Pi 500+ and found it more than adequate for desktop use. This got me thinking about whether, with its quad-core BCM2712 (2.4GHz A76) processor, 16GB of RAM, and, maybe most importantly, a 256GB NVM drive, I could use it to run and learn more about AI. A quick Google search showed that others have used it for this purpose and that a few AI projects have used it as a platform. Since the Pi 500+ is based on the more popular Pi 5, I will use that model's name interchangeably in this article.

People using it to run AI locally surprised me, as I usually think of AI projects running on powerful computers in massive, power-hungry data centers. After a little more research, I found that edge AI, using smaller, low-powered systems, is in use and growing in popularity.

Before diving into my attempts to run AI on my Pi, I want to use this article to set the stage on the terms and technologies used in AI.

Overview of LLM, RAG, on a Pi
The heart of AI is its use of Large Language Models (LLMs). LLMs, such as OpenAI's GPT, Google's PaLM, Qwen, and others, have revolutionized how computers understand and generate human language and, more importantly, how we do our work.

When paired with techniques like Retrieval-Augmented Generation (RAG), these LLMs are even more powerful, enabling the delivery of context-aware, accurate, and up-to-date information.

Recent advancements in hardware and model optimization have enabled running smaller versions of these technologies even on local devices such as a Raspberry Pi 5.

In the rest of this article, I will dig into what LLMs and RAG are, what they are used for, and how they can be utilized on small local computers.

Understanding LLMs
At their core, LLMs are artificial intelligence systems trained on massive text datasets to understand, generate, and manipulate human language. These models are built to enable them to understand context, word relationships, and subtle language nuances. Unlike traditional programs that follow explicit instructions, LLMs learn patterns from data, allowing them to (hopefully) generate coherent, contextually relevant answers to our questions or prompts.

Uses of LLMs
LLMs are actually quite versatile and can, and are, being used for many different purposes. They are excellent at Natural Language Understanding (NLU), enabling them to comprehend human input in both text and, gaining in popularity, speech. This makes them effective for chatbots, virtual assistants, and automated customer service solutions.

But this is really just the tip of the iceberg of what they can do, as LLMs are now also being used to produce entire articles, code (see my article on Vibe coding), summaries, and even graphics (see examples sprinkled throughout this article).

They can and are used to translate languages or condense large documents into more readable forms.

Finally, in education, they are used as personal tutors, to generate practice questions, and summarize complex papers. Google's NotebookLM and its open-source clones, Open Notebook and SurfSense, are great examples of AI tools that can do this.

Tokens
Another term you may have heard in AI is 'tokens'. LLMs don't read words as humans do; instead, they process text in chunks called tokens. Tokens are interesting as a token isn't always a whole word. It can be a single character, a punctuation mark, or a common sub-word (like "ing" or "pre"). The first step in an LLM is tokenization, which converts raw text into unique numerical IDs. Once converted to numbers, these tokens are turned into embeddings (AKA vectors) so the model can perform the complex math needed to predict the next token in a sequence.

You may see a "context limit" (e.g., 128k tokens), which refers to how many of these building blocks the model can "keep in its head" at once.

You need to keep in mind that many AI applications charge based on the number of tokens that they process. Also, how quickly a computer and a model can process tokens is a good gauge of their suitability for a specific purpose.

RAG: Retrieval-Augmented Generation
While LLMs are the backbone of AI, they do have limitations. One of the main issues is that their knowledge is fixed at the time of their creation (i.e., during training). This means an LLM trained on data up to 2024 won't know about events or publications that occurred after that. This is where Retrieval-Augmented Generation (RAG) comes into play. Using RAG, we no longer need to rely on a model's time frame, as RAG enables us to access and use other data sources.

The process generally involves the LLM formulating a query from the user input, retrieving relevant documents or data from a knowledge base, and generating a response using the retrieved information in combination with the LLM.

Applications for RAG
RAG is used where current or domain-specific information is essential. For example, in customer support, it can provide answers by retrieving data from a company's product manuals or knowledge base. We are even seeing it in the medical field, where doctors and patients can query medical databases for specific information, with the LLM generating human-readable explanations of this, often times complex information.

Another interesting use is in legal and compliance research, as it allows lawyers to query extensive legal documents and regulations and receive concise, easy-to-consume summaries. In summary, RAG extends LLM capabilities, ensuring responses are both relevant and up to date.

Running LLMs and RAG on a Local Computer
Traditionally, LLMs like GPT-4 have required significant computational resources, often only accessible via cloud services due to their size; however, recent techniques are enabling smaller models to run locally.

I hope that my Raspberry Pi 5 will let even hobbyists and developers experiment with LLMs and RAG without relying on cloud AI services.

I know that I will not be able to run a full-scale GPT-4 model on my Pi 5, but there are smaller versions, such as LLaMA, MPT-7B, or GPT-NeoX, that have been used locally on less powerful computers.

RAG Implementation
RAG requires two components: an LLM and a vector database for storing and retrieving information. I hope to use my Raspberry Pi 5 to implement RAG using various free and open-source tools.

The heart of RAG is its vector databases, the most popular of which are Pinecone, Chroma, and Weaviate. This enables the storage of local documents. Doing this converts your local knowledge base into numerical vectors using an embedding model. In practice, when a query is received, the vector database retrieves the relevant documents, and the LLM generates a response based on them.

Using these tools, I hope to set up a personal knowledge assistant on my Pi that can answer questions based on my own documents.

Why Use LLMs and RAG Locally?
I want to run an LLM and RAG locally to learn more about running AI locally on a small, inexpensive system. But in the real world, there are several reasons to run these systems locally rather than rely on cloud-based AI.

Privacy is the primary advantage of running an AI system locally, as sensitive data never leaves your device, making it ideal for personal or confidential projects.

Running models locally is also used when companies need offline access to AI, which is helpful in remote areas or in secure, air-gapped environments.

Additionally, it can be cost-effective, helping you avoid recurring cloud usage fees, especially for frequent or heavy use.

Local deployment enables customization, letting you tailor the LLM and RAG pipeline to your specific needs, such as personal documents, home automation tasks, or niche datasets.

Challenges and Considerations
While I think that this will be possible, running LLMs and RAG locally on a Pi 5 comes with some challenges. I believe performance limitations will be a major factor, as the Pi 5's CPU and RAM are still limited compared to cloud GPUs that popular AI companies use.

Storage requirements may also be an issue, as document embedding and models require disk space. Hopefully, the 500+ NVM drive will have enough space and performance to handle the load.

Additionally, energy consumption is a consideration, as running large computations continuously may cause my Pi to heat up and slow down. Despite these limitations, I think that using a lightweight LLM will make it feasible.

Final Thoughts
Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) are transforming how we interact with computers, enabling more intelligent, context-aware, and versatile applications. While these technologies were once confined to large cloud infrastructures, advancements in model optimization and hardware have made them accessible even on small local computers like the Raspberry Pi 5.

In my next article, I will try to install and run an LLM on my Pi 500+.