How-To
Running AI on a Raspberry Pi, Part 1: Overview
In a recent
article, I looked at the Raspberry Pi 500+ and found it more than
adequate for desktop use. This got me thinking about whether, with
its quad-core BCM2712 (2.4GHz A76) processor, 16GB of RAM, and, maybe
most importantly, a 256GB NVM drive, I could use it to run and
learn more about AI. A quick Google search showed that others
have used it for this purpose and that a few AI projects have used it
as a platform. Since the Pi 500+ is based on the more popular Pi 5, I
will use that model's name interchangeably in this article.
People using it
to run AI locally surprised me, as I usually think of AI projects
running on powerful computers in massive, power-hungry data centers.
After a little more research, I found that edge AI, using smaller,
low-powered systems, is in use and growing in popularity.
Before diving
into my attempts to run AI on my Pi, I want to use this article to
set the stage on the terms and technologies used in AI.
Overview of LLM, RAG, on a Pi
The heart of
AI is its use of Large Language Models (LLMs). LLMs, such as
OpenAI's GPT, Google's PaLM, Qwen, and others, have
revolutionized how computers understand and generate human language
and, more importantly, how we do our work.
When paired
with techniques like Retrieval-Augmented Generation (RAG), these
LLMs are even more powerful, enabling the delivery of context-aware,
accurate, and up-to-date information.
Recent
advancements in hardware and model optimization have enabled
running smaller versions of these technologies even on local devices
such as a Raspberry Pi 5.
In the rest of
this article, I will dig into what LLMs and RAG are, what they are
used for, and how they can be utilized on small local computers.
Understanding LLMs
At their core,
LLMs are artificial intelligence systems trained on massive text
datasets to understand, generate, and manipulate human language.
These models are built to enable them to understand context, word
relationships, and subtle language nuances. Unlike traditional
programs that follow explicit instructions, LLMs learn patterns from
data, allowing them to (hopefully) generate coherent, contextually
relevant answers to our questions or prompts.
Uses of LLMs
LLMs are
actually quite versatile and can, and are, being used for many
different purposes. They are excellent at Natural Language
Understanding (NLU), enabling them to comprehend human input in both
text and, gaining in popularity, speech. This makes them effective
for chatbots, virtual assistants, and automated customer service
solutions.
But this is
really just the tip of the iceberg of what they can do, as LLMs are
now also being used to produce entire articles, code (see my article
on Vibe coding), summaries, and even graphics (see examples sprinkled
throughout this article).
[Click on image for larger view.]
They can and are
used to translate languages or condense large documents into more
readable forms.
Finally, in
education, they are used as personal tutors, to generate practice
questions, and summarize complex papers. Google's NotebookLM and
its open-source clones, Open Notebook and SurfSense, are great
examples of AI tools that can do this.
Tokens
Another term you
may have heard in AI is 'tokens'. LLMs don't read words as humans do;
instead, they process text in chunks called tokens. Tokens are
interesting as a token isn't always a whole word. It can be a single
character, a punctuation mark, or a common sub-word (like "ing"
or "pre"). The first step in an LLM is tokenization, which
converts raw text into unique numerical IDs. Once converted to
numbers, these tokens are turned into embeddings (AKA vectors) so the
model can perform the complex math needed to predict the next token
in a sequence.
You may see a
"context limit" (e.g., 128k tokens), which refers to how
many of these building blocks the model can "keep in its head"
at once.
You need to keep
in mind that many AI applications charge based on the number of
tokens that they process. Also, how quickly a computer and a
model can process tokens is a good gauge of their suitability for a
specific purpose.
[Click on image for larger view.]
RAG: Retrieval-Augmented Generation
While LLMs are
the backbone of AI, they do have limitations. One of the main issues
is that their knowledge is fixed at the time of their creation
(i.e., during training). This means an LLM trained on data up to 2024
won't know about events or publications that occurred after that.
This is where Retrieval-Augmented Generation (RAG) comes into play.
Using RAG, we no longer need to rely on a model's time frame, as
RAG enables us to access and use other data sources.
[Click on image for larger view.]
The process
generally involves the LLM formulating a query from the user input,
retrieving relevant documents or data from a knowledge base, and
generating a response using the retrieved information in combination
with the LLM.
Applications for RAG
RAG is used where
current or domain-specific information is essential. For example, in
customer support, it can provide answers by retrieving data from a
company's product manuals or knowledge base. We are even seeing it
in the medical field, where doctors and patients can query
medical databases for specific information, with the LLM generating
human-readable explanations of this, often times complex information.
Another
interesting use is in legal and compliance research, as it
allows lawyers to query extensive legal documents and regulations and
receive concise, easy-to-consume summaries. In summary, RAG extends
LLM capabilities, ensuring responses are both relevant and up to
date.
Running LLMs and RAG on a Local Computer
Traditionally,
LLMs like GPT-4 have required significant computational resources,
often only accessible via cloud services due to their size; however,
recent techniques are enabling smaller models to run locally.
I hope that my
Raspberry Pi 5 will let even hobbyists and developers experiment with
LLMs and RAG without relying on cloud AI services.
I know that I
will not be able to run a full-scale GPT-4 model on my Pi 5, but
there are smaller versions, such as LLaMA, MPT-7B, or GPT-NeoX, that
have been used locally on less powerful computers.
RAG Implementation
RAG requires
two components: an LLM and a vector database for storing and
retrieving information. I hope to use my Raspberry Pi 5 to implement
RAG using various free and open-source tools.
The heart of
RAG is its vector databases, the most popular of which are
Pinecone, Chroma, and Weaviate. This enables the storage of local
documents. Doing this converts your local knowledge base into
numerical vectors using an embedding model. In practice, when a query
is received, the vector database retrieves the relevant documents,
and the LLM generates a response based on them.
[Click on image for larger view.]
Using these
tools, I hope to set up a personal knowledge assistant on my Pi that
can answer questions based on my own documents.
Why Use LLMs and RAG Locally?
I want to run an
LLM and RAG locally to learn more about running AI locally on a
small, inexpensive system. But in the real world, there are several
reasons to run these systems locally rather than rely on cloud-based
AI.
Privacy is the
primary advantage of running an AI system locally, as sensitive
data never leaves your device, making it ideal for personal or
confidential projects.
Running models
locally is also used when companies need offline access to AI,
which is helpful in remote areas or in secure, air-gapped
environments.
Additionally, it
can be cost-effective, helping you avoid recurring cloud usage
fees, especially for frequent or heavy use.
Local deployment
enables customization, letting you tailor the LLM and RAG pipeline to
your specific needs, such as personal documents, home automation
tasks, or niche datasets.
[Click on image for larger view.]
Challenges and Considerations
While I think
that this will be possible, running LLMs and RAG locally on a Pi 5
comes with some challenges. I believe performance limitations
will be a major factor, as the Pi 5's CPU and RAM are still limited
compared to cloud GPUs that popular AI companies use.
Storage
requirements may also be an issue, as document embedding and models
require disk space. Hopefully, the 500+ NVM drive will have enough
space and performance to handle the load.
Additionally,
energy consumption is a consideration, as running large computations
continuously may cause my Pi to heat up and slow down. Despite
these limitations, I think that using a lightweight LLM will make it
feasible.
[Click on image for larger view.]
Final Thoughts
Large Language
Models (LLMs) and Retrieval-Augmented Generation (RAG) are
transforming how we interact with computers, enabling more
intelligent, context-aware, and versatile applications. While these
technologies were once confined to large cloud infrastructures,
advancements in model optimization and hardware have made them
accessible even on small local computers like the Raspberry Pi 5.
In my next
article, I will try to install and run an LLM on my Pi 500+.