LLM on Raspberry Pi - cu-ecen-aeld/buildroot-assignments-base GitHub Wiki
This guide will help integrate llama.cpp (LLM inference engine) into your Buildroot-based embedded Linux system for Raspberry Pi.
- llama.cpp GitHub: https://github.com/ggerganov/llama.cpp
- Buildroot Package Example: https://github.com/cu-ecen-aeld/buildroot-assignments-base
- Model Downloads: https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF
- Working Project Repo: Buildroot repo, Application repo
- Raspberry Pi 4 (4GB+ RAM recommended)
Always test on full Linux before Buildroot integration!
# On Raspbian/Raspberry Pi OS
sudo apt update
sudo apt install -y build-essential git cmake
# Clone and build llama.cpp
cd ~
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git checkout b4315
mkdir build && cd build
cmake .. -DGGML_CUDA=OFF
make -j4
# Binary location
~/llama.cpp/build/bin/llama-cli
# Create models directory
mkdir -p ~/models
# Download TinyLlama model
cd ~/models
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
# Simple test
~/llama.cpp/build/bin/llama-cli \
--model ~/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
--prompt "Hello, how are you?" \
-n 20
If this doesn't work, fix it before proceeding to Buildroot!
In your project repo, get buildroot as a submodule (as done in assignment 5 and further).
cd ~/buildroot-assignments-base
mkdir -p base_external/package/llama-cppConfig.in:
See: https://github.com/cu-ecen-aeld/final-project-hawa7555/blob/main/base_external/package/llama-cpp/Config.in
llama-cpp.mk:
See: https://github.com/cu-ecen-aeld/final-project-hawa7555/blob/main/base_external/package/llama-cpp/llama-cpp.mk
Above files are the minimal files needed to integrate llama in buildroot, for more systematic structure which can be use in application development. Please refer to this base external folder: https://github.com/cu-ecen-aeld/final-project-hawa7555/tree/main/base_external
(Refer to .mk files and Config.in files in the above base external folder and respective package folder that you want to intergrate, for scope of this document, llama has been shown, but steps will be similar for whisper & piper too which are speech-text and text-speech systems.)
cd ~/buildroot
make menuconfigUse menuconfig to enable the llama-cpp package in your target packages menu.
Include model in image:
# Create overlay directory
mkdir -p base_external/rootfs_overlay/root/models
# Download model (TinyLlama-1.1B is smallest usable model)
wget -O base_external/rootfs_overlay/root/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
# In menuconfig, set:
# System configuration → Root filesystem overlay directories
# to your rootfs_overlay pathAlternative: Create a separate package for model downloads. See example:
https://github.com/cu-ecen-aeld/final-project-hawa7555/blob/main/base_external/package/ai-models/ai-models.mk
cd ~/buildroot
make clean
make -j$(nproc)Build time: 1-3 hours on first build.
Boot Raspberry Pi and login as root.
# Verify llama-cli is installed
which llama-cli
# Expected: /usr/bin/llama-cli
# Check help
llama-cli --help
# Run inference test
llama-cli \
-m /root/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
-p "What is embedded Linux?" \
-n 50
Expected behavior:
- Model loads (5-10 seconds)
- Text generation starts
- No crash/segmentation faults
file /usr/bin/llama-cli
# Must show: ARM aarch64 (not x86-64!)Working examples from the reference implementation:
-
Loading llama in background:
https://github.com/hawa7555/final-project-assignment-hawa7555/blob/main/scripts/start_llama.sh -
Response parser:
https://github.com/hawa7555/final-project-assignment-hawa7555/blob/main/app/llm_interface.c -
Testing LLM:
https://github.com/hawa7555/final-project-assignment-hawa7555/blob/main/test/llm_test.c -
Model download recipe:
https://github.com/cu-ecen-aeld/final-project-hawa7555/blob/main/base_external/package/ai-models/ai-models.mk
Tested models on Raspberry Pi 4 (4GB):
| Model | Size | Load Time | Speed | Quality | Recommended |
|---|---|---|---|---|---|
| TinyLlama-Q2_K | 450MB | 3-4s | 8-10 tok/s | Good | Fast inference |
| TinyLlama-Q4_K_M | 637MB | 5-7s | 6-8 tok/s | Better | Best balance |
| TinyLlama-Q8_0 | 1.1GB | 8-10s | 4-6 tok/s | Best | For 8GB Pi only |
Selection criteria:
- Less model size
- Q4_K_M quantization - best speed & quality tradeoff (balanced)
- Context length 512-1024 - sufficient for conversational AI
- TinyLlama architecture - optimized for low-resource inference
| Issue | Cause | Solution |
|---|---|---|
| "cannot execute binary file" | Binary is x86_64 | Verify cross compiling |
| Very slow (<3 tok/s) | CPU throttling | Check governor: scaling_governor should be performance
|
| Model not found | Wrong path | Verify /root/models/*.gguf exists |
- Prompt templates - Pre-defined templates for common queries
- GPU Support - Future Pi models with better GPU support
- Test on 8GB Pi for better quality & speed
- Fine-tuning Model for specific application
Note: This guide is working configuration as of December 2025. Future versions of llama.cpp or Buildroot may require some adjustments.