AI LLM - superjamie/lazyweb GitHub Wiki
"AI" is a poor name for these things, because they are not artificial and they are not intelligent.
Software
- llama.cpp - MIT license, offline inference server with web interface and OpenAI API, has support for Vulkan, CUDA, ROCm.
My preference when coupled with llama-swap.
All others below use llama.cpp as their model runtime. - LocalAI - API server with web interface, model gallery, All-In-One can do images/sounds/video too, complex but powerful
- LM Studio - Proprietary, non-commercial, Linux AppImage, local GUI with model browser and API server
- GPT4ALL - MIT license, Linux Flatpak, local GUI with model browser and API server
- Msty - Proprietary, non-commercial, Linux AppImage, local GUI with model browser and prompt library
- Ollama - API server with web interface (I think? didn't really like its sideload method so don't use it)
- Llamafile - creates a single binary of llama.cpp and a model which runs anywhere (Lin/Mac/Win)
- Kobold.cpp, SillyTavern, Oogabooga - used by storytelling / roleplay folks, UI specialised for loading "character" cards
Models
General
- Use the "Instruct" or "Chat" version of a model to get an experience like ChatGPT/Claude
- If a general release model is over 9 months old it's probably not worth using, only a very lucky specialist finetune might last that long.
- If a model is over 12 months old it's almost certainly been superceded.
- "Quantization" changes a model from 32-bit or 16-bit floating point (slow, high memory requirement, accurate) to 8-bit or smaller integers (faster, smaller memory requirement, less accurate).
- As a general rule, at Q8: billions of parameters = filesize = RAM requirement, so an 8B model is an 8Gb file and needs ~8Gb RAM.
- In 2023, models were less "dense" and it was always better to use a larger model at a smaller quant, even using Q2.
- In 2025, models are more dense and can suffer quality loss sooner. Some models becomes useless as high as Q6, some are still good at Q4.
Very Small - under 8B, CPU inference is possible and useful
- Meta - Llama 3.2 3B
- Google - Gemma 2 2B
- Microsoft - Phi 3.5 mini
- nVidia - Nemotron Mini 4B
- HuggingFace - SmolLM2 1.7B
- IBM- Granite 3.1 2B
- LG - Exaone 3.5 2.4B
- TII - Falcon 3 3B
Small - under 14B, slow CPU inference or single-gaming-GPU
- Meta - Llama 3.1 8B
- Google - Gemma 2 9B
- IBM - Granite 3.1 8B
- Mistral AI - Mistral 7B v0.3, Ministral 8B, Mistral Nemo 12B
- nVidia - Mistral NeMo Minitron 8B
- LG - Exaone 3.5 7.8B
- TII - Falcon 3 8B
- 01-ai - Yi 9B
- Many Llama 3.x mods: Hermes, FuseChat, Tulu, OLMo,
Code Models
- Qwen 2.5 Coder 7B / 14B / 32B, CodeQwen 1.5 7B
- Yi Coder 9B
- DeepSeek Coder 6.7B and V2 Lite 15.7B
- Do not bother with small code models like 3B
Hardware
If you have an AMD GPU, learn to use Debian/Ubuntu ROCm libraries which support far more GPUs than official. See:
If you have nVidia GPU, follow your distro instructions to setup drivers and CUDA. It just works.
If you have Intel Arc GPU, use Vulkan inference (slow but faster than CPU) or sell it and buy a nVidia. The Intel software ecosystem is just not there. OneAPI is confusing/impossible. IPEX-LLM only provides an outdated fork of llama.cpp. Such a shame because Arc A770 is by far the cheapest way to get 16G VRAM and 17 TFLOPS. Come on Intel, get your act together!
In you have Intel UHD graphics, use CPU inference, the GPU will be slower than the CPU.
Inference Performance
- llama.cpp performance on Apple Silicon
- https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference
- https://www.reddit.com/r/StableDiffusion/comments/13hyn0c/selfreported_gpus_and_iterationssecond_based_on/
Inference performance scales almost exactly with RAM bandwidth.
Using an AMD Ryzen 5 5600G, inference speed is no different with CPU or ROCm because the system RAM is the bottleneck. Faster RAM helps more than anything else. 2166 MHz RAM at 5 tok/sec, 3200 MHZ RAM at 15 tok/sec (Llama 3.2 3B).
Transformer Architecture
- https://medium.com/@plienhar/llm-inference-series-1-introduction-9c78e56ef49d
- https://medium.com/@plienhar/llm-inference-series-2-the-two-phase-process-behind-llms-responses-1ff1ff021cd5
- https://medium.com/@plienhar/llm-inference-series-3-kv-caching-unveiled-048152e461c8
- https://medium.com/@plienhar/llm-inference-series-4-kv-caching-a-deeper-look-4ba9a77746c8
- https://medium.com/@plienhar/llm-inference-series-5-dissecting-model-performance-6144aa93168f
- 3blue1brown - LLMs explained briefly
- 3blue1brown - Neural Networks playlist