Running AI on a Raspberry Pi, Part 2: Running AI on a Pi in Under 5 Minutes -- Virtualization Review

How-To

Running AI on a Raspberry Pi, Part 2: Running AI on a Pi in Under 5 Minutes

By Tom Fenton
03/30/2026

In a recent article, I discussed running AI locally on a relatively low-powered system, particularly a Raspberry Pi 500+. In that article, I discussed the major software components of a local AI, including LLMs and RAGs, what they are, and how they are used. I also discussed why I think the Pi, with its ARM processor, 16 GB RAM, and NVMe drive, could handle the load that AI would require. Since the Pi 500+ is based on the more popular Pi 5, I will use that model's name interchangeably in this article, as I did in my past article.

In this article, I will run a Local Large Language Model (LLM) system on a Raspberry Pi. This was a great way for me to dip my toe into AI, and the best part was that I could do it in less than 5 minutes!

Which LLM on a Pi
Recent advances in model architecture and aggressive quantization have enabled the running of AI models on extremely small devices, such as the Raspberry Pi. People have reported that LLMs in the 1--4 billion parameter range can now deliver impressive performance for tasks such as text generation, reasoning, coding, tool calling, and even vision understanding, all without requiring GPUs, cloud resources, or heavy infrastructure.

Using tools like Ollama and quantized model formats, it is now practical to experiment with local AI on low-power hardware, opening the door to affordable, private, and portable AI deployments that fit literally in the palm of your hand (Pi 5), or in my case, the inside of a keyboard (Pi 500+).

Before starting this project, I reviewed many popular tiny models, including several from the Qwen3 family, EXAONE 4.0, Ministral 3, Jamba Reasoning, IBM's Granite Micro, and Microsoft's Phi-4 Mini. Each had different strengths in areas such as long-context processing, reasoning, multimodal understanding, and agentic capabilities. In the end, I narrowed it down to three small LLMs to work with.

Ollama
Ollama is an open-source AI platform that lets you run LLMs locally, without relying on cloud-hosted AI services. It allows you to download, manage, and run models such as Llama, Mistral, Gemma, and others directly on your own machines, whether they're laptops, desktops, servers, or even Pi systems.

Ollama abstracts away much of the complexity involved in model setup, dependency management, and hardware acceleration. It provides a clean command-line interface and API that developers can easily integrate into their workflows. Ollama is gaining widespread adoption among developers, researchers, IT professionals, and organizations experimenting with private AI deployments, local RAG pipelines, and edge-based inference.

Ollama emerged in response to a movement toward greater access to AI and the decentralization of inference workloads. It really started to gain traction in 2023 and accelerated rapidly through 2024 and 2025 as local model performance improved dramatically. As quantization techniques, optimized runtimes, and efficient architecture made it possible to run capable models on consumer-grade hardware, Ollama quickly positioned itself as an easy way for those with limited resources to get started with AI. Its rapid adoption has been fueled by its simplicity, active open-source community, and growing ecosystem of supported models and integrations.

Ollama Commands
Here are a few of the commands I used with Ollama. These examples use llama3 as the LLM, but you can use other LLMs.

Command	Example	What It Does
ollama pull	ollama pull llama3	Downloads a model from the Ollama registry and stores it locally for offline use.
ollama run	ollama run llama3	Launches an interactive chat session with a model. It will download the model if needed.
ollama list	ollama list	Lists all models currently installed on the local system.
ollama rm	ollama rm llama3	Deletes a locally stored model to free disk space.
ollama show	ollama show llama3	Displays detailed information about a model, including parameters and configuration.
ollama serve	ollama serve	Starts the local Ollama API server for programmatic access and integrations.
ollama ps	ollama ps	Shows currently running models and active inference processes.
ollama stop	ollama stop llama3	Stops a currently running model session.
ollama create	ollama create my-model -f Modelfile	Builds a custom model using a Modelfile configuration.
ollama cp	ollama cp llama3 my-model	Copies an existing model locally, often used as a base for customization.

Tom's Tip: Use /bye to exit ollama.

Installinga Local LLM on Raspberry Pi
Installing and running an LLM on Raspberry Pi is relatively simple using Ollama.

Deciding which LLM to run is more difficult. The "religious" wars over LLMs are intense and make past IT wars, such as Windows vs. Mac or file vs. block storage, seem minor compared to the opinions about LLM selection. After far too much research, I decided to start with qwen2.5.

Qwen is an interesting LLM and made quite a ruckus when it was first released, as it is an LLM from the Chinese company Alibaba. The Qwen family of LLMs were, designed to deliver strong performance across reasoning, coding, multilingual understanding, and general-purpose text generation.

Building on earlier releases, Qwen2.5 introduces improvements to model architecture, training data quality, and alignment techniques, resulting in better instruction-following, more accurate reasoning, and stronger performance on benchmarks for math, code, and knowledge tasks. The model supports multiple sizes, making it suitable for both large-scale cloud deployments and smaller, local or edge-based inference scenarios, like I will be doing. It offers robust multilingual capabilities with strong support for English, Chinese, and many other languages. It has open and enterprise licensing options. Qwen2.5 has quickly become a popular choice for developers building chatbots, AI assistants, coding tools, and retrieval-augmented generation (RAG) systems.

Installing Ollama
I found Ollama very easy to install. Below are the steps I took to install it.

Prepare the Raspberry Pi
Ensure your Raspberry Pi is running the latest version of Raspberry Pi OS. You can do this from its GUI or the command line. I updated it from the command line by entering

sudo apt full-upgrade -y

Install Ollama

I initially tried to install Ollama via the GUI on the Pi desktop, but it wasn't available.

To install it using the command line, I ran the following command. This downloaded the "install.sh" script from Ollama and ran it directly on the terminal by piping it to sh.

curl -fsSL https://ollama.com/install.sh | sh

To verify that it was installed and to see what version was installed, I entered

ollama --version

After installing Ollama, I installed the qwen2.5 LLM by entering
ollama run qwen2.5

It took a few minutes to pull down the LLM and start it.

I stopped the LLM by entering

/bye

I verified it was downloaded by going to another terminal window and entering

Ollama ps

Tom's tip: Ollama keeps loaded models in real memory (RAM) for 4 minutes when they are idle. You can see how long they will remain in memory under the UNTIL column by running ollama ps.

You can immediately unload a model from memory, use the ollama stop command: ollama stop <model name>

I checked which model I was running by entering

ollama list

It showed that I was running qwen2.5:latest

Running an LLM
For the LLM's first test, I gave it a relatively easy task: create the HTML code to display "Hello World" in blue.

To do this, I started ollama again by entering

Ollama run qwen2.5:latest

I then presented it with my first inquiry using the following prompt

Create the HTML code to display "Hello World" in blue

I opened up another terminal and monitored the system's hardware performance using the top command. It showed that the CPU was at 100% on all cores.

It also showed that, 99.8 percent of the time spent running user code (us), the ollama process had a resident memory size of 4.7 GB, and the system had plenty of free RAM.

Using htop, I noticed very little disk activity. This was probably due to the LLM being loaded in RAM.

The system temperature was 54-60 degrees Celsius while the process was running.

The system used between 8 and 9 watts of power while running the LLM.

After a few minutes, information started to slowly appear on the screen, just a few characters at a time.

Finally, the code fully appeared on the screen.

In the end, it took 23 minutes to run and consumed 41 tokens.

Tom's tip: Appending --verbose to the end of a run command will display the query statistics.

The code looked fine, but the time it took to generate it was insufferably long. It then occurred to me that I was running the latest model, which was not optimized for small systems, and was 4.6 GB in size.

I looked up the various Qwen models available.

Shape1 — **[Click on image for larger view.]**

I needed a smaller model, so I switched to 3b, which was less than half the size.

I stopped and removed the large model by entering

ollama stop qwen2.5:latest

ollama rm

And then downloaded and installed the smaller LLM by entering

Ollama run qwen2.5:3b

The query ran faster (17 min vs. 24) yet produced the same code.

Final Thoughts
My experience running an LLM on a Raspberry Pi was interesting. On the one hand, using Ollama made downloading, installing, and querying the LLM dead simple, and I was able to do it in less than 5 minutes. I have never had such an easy time installing such a powerful piece of software. However, the LLM's performance was lacking; waiting 17 minutes for a response to a simple question is unacceptable, as it is literally 3 times longer than it took me to install the LLM.

I hope the slow response time was simply a factor of the LLM I chose and not an indicator of running AI on a Pi. In my next article, I will test other LLMs on the Pi to see if the performance is more acceptable.