llama4 scout triple24gb gpu - mostlygeek/llama-swap GitHub Wiki

This guides uses of Unsloth's Q4_K_XL model on HuggingFace and was adapted from their excellent how to run guide.

This configuration loads everything into VRAM so you will need at least THREE 24GB video cards. The context size is set at 62,000 tokens to use 68,619MB of VRAM with model and context fully loaded. You may be able to push it a bit further but this configuration runs quite stable.

Config

models: 
  "llama4-scout":
    cmd: |
      /path/to/llama-server/llama-server-latest --slots --host 127.0.0.1 --port ${PORT} 
      --flash-attn -ngl 999 -ngld 999 --no-mmap
      --cache-type-k q8_0 --cache-type-v q8_0
      --ctx-size 62000
      --model /path/to/models/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf
      --samplers "top_k;top_p;min_p;dry;temperature;typ_p;xtc"
      --dry-multiplier 0.8
      --temp 0.6
      --min-p 0.01
      --top-p 0.9
      --swa-full

[!TIP] Check out path.to.sh for real /path/to/models/... in the example.