LLM on Raspberry Pi - cu-ecen-aeld/buildroot-assignments-base GitHub Wiki

Details

This guide will help integrate llama.cpp (LLM inference engine) into your Buildroot-based embedded Linux system for Raspberry Pi.

References

llama.cpp GitHub: https://github.com/ggerganov/llama.cpp
Buildroot Package Example: https://github.com/cu-ecen-aeld/buildroot-assignments-base
Model Downloads: https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF
Working Project Repo: Buildroot repo, Application repo

Requirements

Raspberry Pi 4 (4GB+ RAM recommended)

Step 1: Validate on Raspbian OS First

Always test on full Linux before Buildroot integration!

# On Raspbian/Raspberry Pi OS
sudo apt update
sudo apt install -y build-essential git cmake

# Clone and build llama.cpp
cd ~
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git checkout b4315
mkdir build && cd build
cmake .. -DGGML_CUDA=OFF
make -j4

# Binary location
~/llama.cpp/build/bin/llama-cli

# Create models directory
mkdir -p ~/models

# Download TinyLlama model
cd ~/models
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf

# Simple test
~/llama.cpp/build/bin/llama-cli \
  --model ~/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
  --prompt "Hello, how are you?" \
  -n 20

If this doesn't work, fix it before proceeding to Buildroot!

Step 2: Create Buildroot Package

2.1 Create Directory Structure

In your project repo, get buildroot as a submodule (as done in assignment 5 and further).

cd ~/buildroot-assignments-base
mkdir -p base_external/package/llama-cpp

2.2 Create Package Files

Config.in:
See: https://github.com/cu-ecen-aeld/final-project-hawa7555/blob/main/base_external/package/llama-cpp/Config.in

llama-cpp.mk:
See: https://github.com/cu-ecen-aeld/final-project-hawa7555/blob/main/base_external/package/llama-cpp/llama-cpp.mk

Above files are the minimal files needed to integrate llama in buildroot, for more systematic structure which can be use in application development. Please refer to this base external folder: https://github.com/cu-ecen-aeld/final-project-hawa7555/tree/main/base_external

(Refer to .mk files and Config.in files in the above base external folder and respective package folder that you want to intergrate, for scope of this document, llama has been shown, but steps will be similar for whisper & piper too which are speech-text and text-speech systems.)

Step 3: Configure Buildroot

cd ~/buildroot
make menuconfig

Use menuconfig to enable the llama-cpp package in your target packages menu.

Step 4: Deploy Model File

Include model in image:

# Create overlay directory
mkdir -p base_external/rootfs_overlay/root/models

# Download model (TinyLlama-1.1B is smallest usable model)
wget -O base_external/rootfs_overlay/root/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
  https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf

# In menuconfig, set:
# System configuration → Root filesystem overlay directories
# to your rootfs_overlay path

Alternative: Create a separate package for model downloads. See example:
https://github.com/cu-ecen-aeld/final-project-hawa7555/blob/main/base_external/package/ai-models/ai-models.mk

Step 5: Build

cd ~/buildroot
make clean
make -j$(nproc)

Build time: 1-3 hours on first build.

Validation (on Raspberry Pi)

1. Boot and Check Installation

Boot Raspberry Pi and login as root.

# Verify llama-cli is installed
which llama-cli
# Expected: /usr/bin/llama-cli

# Check help
llama-cli --help

Pi llama buildroot

2. Test Basic Inference

# Run inference test
llama-cli \
    -m /root/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
    -p "What is embedded Linux?" \
    -n 50

Pi llama prompt

Expected behavior:

Model loads (5-10 seconds)
Text generation starts
No crash/segmentation faults

3. Verify Architecture

file /usr/bin/llama-cli
# Must show: ARM aarch64 (not x86-64!)

Integration with Your Application

Working examples from the reference implementation:

Loading llama in background:
https://github.com/hawa7555/final-project-assignment-hawa7555/blob/main/scripts/start_llama.sh
Response parser:
https://github.com/hawa7555/final-project-assignment-hawa7555/blob/main/app/llm_interface.c
Testing LLM:
https://github.com/hawa7555/final-project-assignment-hawa7555/blob/main/test/llm_test.c
Model download recipe:
https://github.com/cu-ecen-aeld/final-project-hawa7555/blob/main/base_external/package/ai-models/ai-models.mk

Model Selection Guide

Why TinyLlama-1.1B-Q4_K_M?

Tested models on Raspberry Pi 4 (4GB):

Model	Size	Load Time	Speed	Quality	Recommended
TinyLlama-Q2_K	450MB	3-4s	8-10 tok/s	Good	Fast inference
TinyLlama-Q4_K_M	637MB	5-7s	6-8 tok/s	Better	Best balance
TinyLlama-Q8_0	1.1GB	8-10s	4-6 tok/s	Best	For 8GB Pi only

Selection criteria:

Less model size
Q4_K_M quantization - best speed & quality tradeoff (balanced)
Context length 512-1024 - sufficient for conversational AI
TinyLlama architecture - optimized for low-resource inference

Debugging common issues

Issue	Cause	Solution
"cannot execute binary file"	Binary is x86_64	Verify cross compiling
Very slow (<3 tok/s)	CPU throttling	Check governor: `scaling_governor` should be `performance`
Model not found	Wrong path	Verify `/root/models/*.gguf` exists

Future Work & Improvements

Prompt templates - Pre-defined templates for common queries
GPU Support - Future Pi models with better GPU support
Test on 8GB Pi for better quality & speed
Fine-tuning Model for specific application

Note: This guide is working configuration as of December 2025. Future versions of llama.cpp or Buildroot may require some adjustments.