Raspberry PiEdge AIHow-to

Get Started with the AI HAT+ 2 on Raspberry Pi 5: A Practical Setup & Project Guide

UUnknown

2026-01-21

10 min read

Step-by-step guide to install AI HAT+ 2 on Raspberry Pi 5, benchmark it, and deploy three edge AI projects (voice, microservice, image).

Hook: Stop juggling cloudy AI and slow edge prototypes — get the AI HAT+ 2 running on Raspberry Pi 5, fast

If you manage tooling for developers or run PoCs for edge AI, you know the pain: architectures that work in the cloud often crumble at the edge, onboarding time is huge, and measuring real-world ROI is a guessing game. The AI HAT+ 2 ($130) for the Raspberry Pi 5 changes that equation — a compact accelerator that brings hardware-accelerated inference and on-device LLM capability to Pi-class devices. This guide walks you from unboxing to production-ready deployments: setup, benchmarks, and three real-world projects (voice assistant, inference microservice, image classifier) with practical commands, deployment patterns, and operational tips tailored for developers and IT admins in 2026.

What’s changed in 2026 (and why it matters)

Edge AI in late 2025–2026 has a few clear trends you need to know:

On-device LLMs are practical — quantized LLMs with ggml/llama.cpp-style runtimes and hardware NPU support are now common on class‑C devices. This reduces latency, lowers bandwidth, and improves privacy for many use cases.
Hardware acceleration ecosystems matured — popular runtimes (ONNX Runtime, OpenVINO, ncnn) have production-grade backends for small NPUs and Vision/ML accelerators, making porting models far easier.
Edge-first architectures — hybrid patterns (on-device inferencing + cloud only for heavy tasks) are mainstream in enterprise deployments to control costs and improve uptime.

“Measure on the device — not in the lab.” — a rule of thumb for edge deployment teams in 2026.

Before you start: hardware, power, and OS checklist

Prepare for predictable deployments with this short checklist.

Raspberry Pi 5 with a 64-bit OS (Raspberry Pi OS 64‑bit or Ubuntu 24.04/26.04 LTS recommended).
AI HAT+ 2 (HAT form factor that mounts to the Pi’s 40‑pin header). Confirm vendor firmware and OS compatibility.
Power supply: a high-quality 5V USB-C supply rated 6A or more for Pi 5 + accelerator + peripherals.
Cooling: active fan or a large heatsink. NPUs and Pi5’s CPU get warm under sustained load.
Storage: NVMe/SSD or fast microSD (32GB+ recommended) for swap, models, and logs.
Network: wired Ethernet preferred for reproducible benchmarks; Wi‑Fi for field deployments.

Step-by-step: Hardware assembly and first boot

1. Mount the AI HAT+ 2

Power off the Pi and attach the AI HAT+ 2 to the 40-pin header. Secure with the included standoffs. If your HAT uses an alternative connector, follow the vendor instructions but ensure the board is grounded and firmly seated.

2. Attach peripherals and power

Connect a keyboard and HDMI for the first-time setup, or use headless SSH if you prefer. Plug in a quality power supply and boot the Pi. Watch the serial/console during the first boot for firmware messages from the HAT — they often show driver or firmware version info.

3. Flash the OS (if needed)

Use Raspberry Pi Imager or Ubuntu’s image tool to write a 64‑bit image. Example: Raspberry Pi OS (64-bit). On first boot, enable SSH and set a secure password and change default keys.

sudo apt update && sudo apt upgrade -y
sudo apt install -y git build-essential python3-pip docker.io docker-compose

Software stack: Install drivers, runtimes, and tools

Install the AI HAT+ 2 vendor drivers and the collection of runtimes you need for your projects.

1. Vendor drivers & firmware

Follow the vendor-provided instructions. Typical steps:

git clone https://github.com/vendor/ai-hat-plus-2-drivers.git
cd ai-hat-plus-2-drivers
sudo ./install.sh

After install, reboot and confirm the device is recognized via dmesg or lsusb (or the vendor's diagnostic tool).

2. Install common runtime toolchain

Install Python, ONNX Runtime, and container support. The AI HAT+ 2 usually works with ONNX Runtime with an NPU delegate; check vendor docs for the correct pip wheel or shared object.

sudo apt install -y python3-venv python3-dev
python3 -m venv venv && source venv/bin/activate
pip install --upgrade pip
pip install onnxruntime onnx numpy pillow fastapi uvicorn

For on-device LLMs install llama.cpp/ggml or vendor LLM runtime. Example (build llama.cpp):

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

Benchmarks: measure baseline performance and power

Before you optimize, get baseline metrics. Run the vendor benchmark, then a simple ONNX inference benchmark. If you’re interested in low-latency infrastructure patterns and container strategies, see Edge Containers & Low‑Latency Architectures for deeper patterns.

1. Vendor synthetic benchmark

Run the HAT’s built-in benchmark to verify NPU health:

sudo ai-hat-bench --run
# or vendor specific
sudo /opt/ai-hat/bin/benchmark --full

2. Simple ONNX inference benchmark

Use a small model (MobileNet) and measure latency and memory use. Save the script as bench_infer.py and run it.

cat > bench_infer.py <<PY
import onnxruntime as rt
import numpy as np
import time
sess = rt.InferenceSession('mobilenetv2-1.0.onnx', providers=['CPUExecutionProvider'])
input_name = sess.get_inputs()[0].name
x = np.random.rand(1,3,224,224).astype('float32')
N=100
start=time.time()
for _ in range(N):
    _=sess.run(None, {input_name: x})
print('avg ms:', (time.time()-start)/N*1000)
PY
python3 bench_infer.py

Repeat the test with the NPU provider once installed (replace providers list). Record avg latency, peak memory, and CPU usage (top or htop).

3. Power profiling

Measure current draw with a USB power meter or inline power monitor. Log power during idle and sustained inference to estimate energy per inference.

Project 1 — Local voice assistant (offline-first)

Goal: a low-latency, privacy-preserving assistant that runs fully on the device for most queries and falls back to cloud only when needed.

Architecture

Wake-word and VAD: Porcupine, WebRTC VAD, or Vosk
Speech-to-text (STT): small quantized model (wav2vec2 or VOSK) with NPU acceleration
Command processing: local rule-based intents + on-device LLM (quantized) for open responses
Text-to-speech (TTS): Coqui TTS or eSpeak NG (or vendor hardware-accelerated TTS)

Quick setup

pip install sounddevice webrtcvad coqui-ai-tts vosk
# Download a small STT model (Vosk)
wget https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip
unzip vosk-model-small-en-us-0.15.zip -d models/vosk-small

Run a minimal pipeline

This example shows a loop: listen > VAD > STT > LLM > TTS. Use llama.cpp for the LLM inference step when offline.

python3 voice_assistant.py --model-dir models/vosk-small --llm models/ggml-alpaca-q4.bin
# or run as a Docker container for consistent deployment

Operational tips

Use VAD aggressively to minimize wasted STT cycles.
Hybrid strategy: run intent-based commands locally and only call cloud LLMs for complex queries.
Quantized LLMs are crucial: 4‑bit or 8‑bit quant models reduce memory and speed up inference on NPUs.

Project 2 — Inference microservice (FastAPI + NPU)

Goal: a reproducible, containerized inference microservice that exposes a REST API for image and text inference, suitable for fleet deployment.

Why this pattern

Containers simplify lifecycle management for IT admins. Use Docker Compose or k3s to orchestrate multiple Pi+HAT nodes. Keep containers minimal and pin versions for reproducibility.

Sample Dockerfile (minimal)

FROM python:3.11-slim
RUN apt-get update && apt-get install -y libsndfile1
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . /app
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

FastAPI app: main.py

from fastapi import FastAPI, File, UploadFile
import onnxruntime as rt
app = FastAPI()
model = rt.InferenceSession('/models/mobilenet.onnx', providers=['NPUProvider','CPUExecutionProvider'])
@app.post('/classify')
async def classify(file: UploadFile = File(...)):
    # load image -> preprocess -> run
    return {'label': 'example'}

Deployment tips

Use healthchecks and auto-restart in your container runtime for robustness.
Log structured JSON for easy ingestion into central observability (Fluent Bit, Prometheus metrics). For edge observability and governance patterns see our policy-as-code & edge observability playbook.
Use connection pooling and limit request size to avoid OOM on the Pi.

Project 3 — Image classifier for on-device inference

Goal: run an image classifier (MobileNet/EfficientNet) accelerated by the AI HAT+ 2. This is a workhorse workload for many IoT cameras and visual inspection tasks; for small CCTV and camera-first deployments see Hybrid Edge Strategies for Small Business CCTV.

Model preparation

Start from a TensorFlow/PyTorch pre-trained model.
Convert to ONNX and apply post-training quantization (INT8 or lower) using tools like ONNX Runtime quantization or OpenVINO quantizer.
Validate accuracy vs. baseline; expect small degradation when aggressively quantized.

Convert example (PyTorch → ONNX)

python3 convert_to_onnx.py --model mobilenet_v2 --output mobilenetv2.onnx
# then quantize
python3 -m onnxruntime.quantization.quantize --input mobilenetv2.onnx --output mobilenetv2_int8.onnx --mode IntegerOps

Inference script and benchmarking

python3 image_infer.py --model mobilenetv2_int8.onnx --images testset/
# measure throughput with the bench script from earlier

Edge optimization tips

Input resizing on the device reduces compute — crop to ROI if you have camera metadata.
Batching tradeoffs: micro-batches (2–8) improve throughput but increase latency.
Use mmap for large models where supported to reduce memory pressure.
For secure, low-latency image verification pipelines see Edge-First Image Verification.

Advanced strategies and hard-earned operational tips

After prototyping, apply these techniques to move to production.

Model versioning & A/B — ship models as artifacts and run A/B traffic to evaluate accuracy and cost before full rollout.
Observability — instrument per-inference latency, GPU/NPU utilization, memory, and power; send aggregates to a central dashboard.
Graceful fallbacks — detect NPU failures and fallback to CPU provider smoothly to keep the service available.
Security — sign models at rest, enable disk encryption, and run containers with least privileges.
Swap and zRAM — configure carefully; zRAM improves responsiveness but can hide poor memory planning.
Model sharding and cascade — use a small fast model to filter easy cases and a larger model for hard cases to save cycles. See work on Causal ML at the Edge for related pipeline considerations.

Benchmarking checklist and what to report

When you benchmark, report these standard metrics so stakeholders can compare configurations:

Average and p95 latency (ms)
Throughput (inferences/sec)
Memory usage (MB) — peak and resident
Power draw (W) — idle and under load
Model size and quantization level
Accuracy delta vs. baseline

Common troubleshooting

Device not detected

Check dmesg and vendor diagnostics. Re-seat the HAT and reflash vendor firmware if needed.

OOMs during model load

Try quantization, use mmap, or increase swap/zRAM temporarily while you optimize the model size.

High latency spikes

Watch for thermal throttling — add active cooling. Also check for background updates or cron jobs that cause CPU contention.

Future-proofing: 2026 and beyond

Edge AI will continue to evolve quickly. As of 2026, expect:

Better micro‑quantization (4-bit and hybrid formats) that keeps accuracy while lowering memory.
Standardized NPU runtimes — vendor-specific drivers will give way to interoperable delegates in ONNX Runtime and WASM-based runtimes.
Federated and continual learning patterns for field devices to personalize models while preserving privacy.

Recap: Actionable checklist to go from zero to production

Assemble hardware, confirm power and cooling.
Install 64‑bit OS, vendor drivers, ONNX Runtime/NPU delegate.
Run vendor benchmark + simple ONNX benchmark. Record metrics.
Prototype one workload (voice, microservice, or classifier) using quantized models.
Containerize the service, add healthchecks and logging, and test OTA updates.
Deploy to a small fleet, collect observability data, and iterate model versions.

Closing thoughts and next steps

The AI HAT+ 2 on the Raspberry Pi 5 represents the practical bridge between prototype and production for many edge AI use cases in 2026. With a measured approach — baseline benchmarks, quantized models, containerized services, and solid observability — you can ship low-latency, privacy-preserving AI features without expensive cloud dependence.

Ready to get hands-on? Clone our sample repo (voice assistant, FastAPI microservice, and image inference) and run the vendor benchmark in under an hour. If you need a tailored checklist or deployment template for your fleet, reach out — we’ve helped IT teams standardize toolchains and cut edge inference costs by 40% on similar rollouts.

Call to action

Download the sample code, run the vendor benchmark, and share your numbers. Try one of the three projects above on a single Pi+HAT node and report back— we’ll help interpret the results and provide a production-ready deployment checklist tailored to your environment.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.