How-To
Running AI on a Raspberry Pi, Part 2: Running AI on a Pi in Under 5 minutes
In a recent
article, I discussed running AI locally on a relatively
low-powered system, particularly a Raspberry Pi 500+. In that
article, I discussed the major software components of a local AI,
including LLMs and RAGs, what they are, and how they are used. I also
discussed why I think the Pi, with its ARM processor, 16 GB RAM, and
NVMe drive, could handle the load that AI would require. Since the Pi
500+ is based on the more popular Pi 5, I will use that model's name
interchangeably in this article, as I did in my past article.
In this article,
I will run a Local Large Language Model (LLM) system on a
Raspberry Pi. This was a great way for me to dip my toe into AI,
and the best part was that I could do it in less than 5 minutes!
Which LLM on a Pi
Recent advances
in model architecture and aggressive quantization have enabled the
running of AI models on extremely small devices, such as the
Raspberry Pi. People have reported that LLMs in the 1--4 billion
parameter range can now deliver impressive performance for tasks such
as text generation, reasoning, coding, tool calling, and even
vision understanding, all without requiring GPUs, cloud resources, or
heavy infrastructure.
Using tools like
Ollama and quantized model formats, it is now practical to
experiment with local AI on low-power hardware, opening the door
to affordable, private, and portable AI deployments that fit
literally in the palm of your hand (Pi 5), or in my case, the inside
of a keyboard (Pi 500+).
[Click on image for larger view.]
Before starting
this project, I reviewed many popular tiny models, including several
from the Qwen3 family, EXAONE 4.0, Ministral 3, Jamba Reasoning,
IBM's Granite Micro, and Microsoft's Phi-4 Mini. Each had
different strengths in areas such as long-context processing,
reasoning, multimodal understanding, and agentic capabilities. In the
end, I narrowed it down to three small LLMs to work with.
Ollama
Ollama is an
open-source AI platform that lets you run LLMs locally, without
relying on cloud-hosted AI services. It allows you to download,
manage, and run models such as Llama, Mistral, Gemma, and others
directly on your own machines, whether they're laptops, desktops,
servers, or even Pi systems.
Ollama
abstracts away much of the complexity involved in model setup,
dependency management, and hardware acceleration. It provides a clean
command-line interface and API that developers can easily integrate
into their workflows. Ollama is gaining widespread adoption among
developers, researchers, IT professionals, and organizations
experimenting with private AI deployments, local RAG pipelines, and
edge-based inference.
[Click on image for larger view.]
Ollama emerged in
response to a movement toward greater access to AI and the
decentralization of inference workloads. It really started to gain
traction in 2023 and accelerated rapidly through 2024 and 2025 as
local model performance improved dramatically. As quantization
techniques, optimized runtimes, and efficient architecture made it
possible to run capable models on consumer-grade hardware, Ollama
quickly positioned itself as an easy way for those with limited
resources to get started with AI. Its rapid adoption has been
fueled by its simplicity, active open-source community, and growing
ecosystem of supported models and integrations.
Ollama Commands
Here are a few of
the commands I used with Ollama. These examples use llama3 as the
LLM, but you can use other LLMs.
Command
|
Example
|
What
It Does
|
ollama
pull
|
ollama
pull llama3
|
Downloads
a model from the Ollama registry and stores it locally for offline
use.
|
ollama
run
|
ollama
run llama3
|
Launches
an interactive chat session with a model. It will download the
model if needed.
|
ollama
list
|
ollama
list
|
Lists
all models currently installed on the local system.
|
ollama
rm
|
ollama
rm llama3
|
Deletes
a locally stored model to free disk space.
|
ollama
show
|
ollama
show llama3
|
Displays
detailed information about a model, including parameters and
configuration.
|
ollama
serve
|
ollama
serve
|
Starts
the local Ollama API server for programmatic access and
integrations.
|
ollama
ps
|
ollama
ps
|
Shows
currently running models and active inference processes.
|
ollama
stop
|
ollama
stop llama3
|
Stops
a currently running model session.
|
ollama
create
|
ollama
create my-model -f Modelfile
|
Builds
a custom model using a Modelfile configuration.
|
ollama
cp
|
ollama
cp llama3 my-model
|
Copies
an existing model locally, often used as a base for customization.
|
Tom's Tip: Use /bye to exit ollama.
Installinga Local LLM on Raspberry Pi
Installing and
running an LLM on Raspberry Pi is relatively simple using Ollama.
Deciding which
LLM to run is more difficult. The "religious" wars over LLMs
are intense and make past IT wars, such as Windows vs. Mac or
file vs. block storage, seem minor compared to the opinions about LLM
selection. After far too much research, I decided to start with
qwen2.5.
Qwen is an
interesting LLM and made quite a ruckus when it was first released,
as it is an LLM from the Chinese company Alibaba. The Qwen family of
LLMs were, designed to deliver strong performance across reasoning,
coding, multilingual understanding, and general-purpose text
generation.
Building on
earlier releases, Qwen2.5 introduces improvements to model
architecture, training data quality, and alignment techniques,
resulting in better instruction-following, more accurate reasoning,
and stronger performance on benchmarks for math, code, and knowledge
tasks. The model supports multiple sizes, making it suitable
for both large-scale cloud deployments and smaller, local or
edge-based inference scenarios, like I will be doing. It offers
robust multilingual capabilities with strong support for English,
Chinese, and many other languages. It has open and enterprise
licensing options. Qwen2.5 has quickly become a popular choice for
developers building chatbots, AI assistants, coding tools, and
retrieval-augmented generation (RAG) systems.
Installing Ollama
I found Ollama
very easy to install. Below are the steps I took to install it.
Prepare the
Raspberry Pi
Ensure your Raspberry Pi is running the latest
version of Raspberry Pi OS. You can do this from its GUI or the
command line. I updated it from the command line by entering
sudo apt full-upgrade -y
Install
Ollama
I initially tried to install Ollama via the GUI on the Pi desktop,
but it wasn't available.
[Click on image for larger view.]
To install
it using the command line, I ran the following command. This
downloaded the "install.sh" script from Ollama and ran it
directly on the terminal by piping it to sh.
curl -fsSL https://ollama.com/install.sh | sh
[Click on image for larger view.]
To verify
that it was installed and to see what version was installed, I
entered
ollama --version
[Click on image for larger view.]
After
installing Ollama, I installed the qwen2.5 LLM by entering
ollama
run qwen2.5
It took a few minutes to pull down the LLM and start it.
I stopped
the LLM by entering
/bye
I verified
it was downloaded by going to another terminal window and entering
Ollama ps
[Click on image for larger view.]
Tom's tip: Ollama keeps loaded models in real memory (RAM) for
4 minutes when they are idle. You can see how long they will remain
in memory under the UNTIL column by running ollama ps.
You can immediately unload a model from memory, use the ollama stop
command: ollama stop <model name>
I checked which model I was running by entering
ollama list
[Click on image for larger view.]
It showed that I
was running qwen2.5:latest
Running an LLM
For the LLM's
first test, I gave it a relatively easy task: create the HTML code to
display "Hello World" in blue.
To do this, I
started ollama again by entering
Ollama run qwen2.5:latest
I then presented
it with my first inquiry using the following prompt
Create
the HTML code to display "Hello World" in blue
I opened up
another terminal and monitored the system's hardware performance
using the top command. It showed that the CPU was at 100% on all
cores.
[Click on image for larger view.]
It also showed
that, 99.8 percent of the time spent running user code (us),
the ollama process had a resident memory size of 4.7 GB, and the
system had plenty of free RAM.
[Click on image for larger view.]
Using htop, I
noticed very little disk activity. This was probably due to the LLM
being loaded in RAM.
[Click on image for larger view.]
The system
temperature was 54-60 degrees Celsius while the process was running.
The system used
between 8 and 9 watts of power while running the LLM.
[Click on image for larger view.]
After a few
minutes, information started to slowly appear on the screen, just a
few characters at a time.
Finally, the code
fully appeared on the screen.
[Click on image for larger view.]
In the end, it
took 23 minutes to run and consumed 41 tokens.
Tom's tip: Appending --verbose to the end of a run command
will display the query statistics.
The code looked
fine, but the time it took to generate it was insufferably long. It
then occurred to me that I was running the latest model, which was
not optimized for small systems, and was 4.6 GB in size.
[Click on image for larger view.]
I looked up the various Qwen models available.
[Click on image for larger view.]
I needed a
smaller model, so I switched to 3b, which was less than half
the size.
I stopped and
removed the large model by entering
ollama stop qwen2.5:latest
ollama rm
And then
downloaded and installed the smaller LLM by entering
Ollama run qwen2.5:3b
The query ran
faster (17 min vs. 24) yet produced the same code.
[Click on image for larger view.]
Final Thoughts
My experience
running an LLM on a Raspberry Pi was interesting. On the one hand,
using Ollama made downloading, installing, and querying the LLM
dead simple, and I was able to do it in less than 5 minutes. I
have never had such an easy time installing such a powerful piece of
software. However, the LLM's performance was lacking; waiting 17
minutes for a response to a simple question is unacceptable, as it is
literally 3 times longer than it took me to install the LLM.
I hope the slow
response time was simply a factor of the LLM I chose and not an
indicator of running AI on a Pi. In my next article, I will test
other LLMs on the Pi to see if the performance is more acceptable.