AI LLM - superjamie/lazyweb GitHub Wiki

"AI" is a poor name for these things, because they are not artificial and they are not intelligent.

Software

llama.cpp - MIT license, offline inference server with web interface and OpenAI API, has support for Vulkan, CUDA, ROCm.
My preference when coupled with llama-swap.
All others below use llama.cpp as their model runtime.
LocalAI - API server with web interface, model gallery, All-In-One can do images/sounds/video too, complex but powerful
LM Studio - Proprietary, non-commercial, Linux AppImage, local GUI with model browser and API server
GPT4ALL - MIT license, Linux Flatpak, local GUI with model browser and API server
Msty - Proprietary, non-commercial, Linux AppImage, local GUI with model browser and prompt library
Ollama - API server with web interface (I think? didn't really like its sideload method so don't use it)
Llamafile - creates a single binary of llama.cpp and a model which runs anywhere (Lin/Mac/Win)
Kobold.cpp, SillyTavern, Oogabooga - used by storytelling / roleplay folks, UI specialised for loading "character" cards

Models

General

Use the "Instruct" or "Chat" version of a model to get an experience like ChatGPT/Claude
If a general release model is over 9 months old it's probably not worth using, only a very lucky specialist finetune might last that long.
If a model is over 12 months old it's almost certainly been superceded.
"Quantization" changes a model from 32-bit or 16-bit floating point (slow, high memory requirement, accurate) to 8-bit or smaller integers (faster, smaller memory requirement, less accurate).
As a general rule, at Q8: billions of parameters = filesize = RAM requirement, so an 8B model is an 8Gb file and needs ~8Gb RAM.
In 2023, models were less "dense" and it was always better to use a larger model at a smaller quant, even using Q2.
In 2025, models are more dense and can suffer quality loss sooner. Some models becomes useless as high as Q6, some are still good at Q4.

Very Small - under 8B, CPU inference is possible and useful

Meta - Llama 3.2 3B
Google - Gemma 2 2B
Microsoft - Phi 3.5 mini
nVidia - Nemotron Mini 4B
HuggingFace - SmolLM2 1.7B
IBM- Granite 3.1 2B
LG - Exaone 3.5 2.4B
TII - Falcon 3 3B

Small - under 14B, slow CPU inference or single-gaming-GPU

Meta - Llama 3.1 8B
Google - Gemma 2 9B
IBM - Granite 3.1 8B
Mistral AI - Mistral 7B v0.3, Ministral 8B, Mistral Nemo 12B
nVidia - Mistral NeMo Minitron 8B
LG - Exaone 3.5 7.8B
TII - Falcon 3 8B
01-ai - Yi 9B
Many Llama 3.x mods: Hermes, FuseChat, Tulu, OLMo,

Code Models

Qwen 2.5 Coder 7B / 14B / 32B, CodeQwen 1.5 7B
Yi Coder 9B
DeepSeek Coder 6.7B and V2 Lite 15.7B
Do not bother with small code models like 3B

Hardware

If you have an AMD GPU, learn to use Debian/Ubuntu ROCm libraries which support far more GPUs than official. See:

https://github.com/superjamie/rocswap

If you have nVidia GPU, follow your distro instructions to setup drivers and CUDA. It just works.

If you have Intel Arc GPU, use Vulkan inference (slow but faster than CPU) or sell it and buy a nVidia. The Intel software ecosystem is just not there. OneAPI is confusing/impossible. IPEX-LLM only provides an outdated fork of llama.cpp. Such a shame because Arc A770 is by far the cheapest way to get 16G VRAM and 17 TFLOPS. Come on Intel, get your act together!

In you have Intel UHD graphics, use CPU inference, the GPU will be slower than the CPU.

Inference Performance

Inference performance scales almost exactly with RAM bandwidth.