gemma3 27b 100k context - mostlygeek/llama-swap GitHub Wiki

Warning

There is a currently a bug with gemma3 at high context that produces <unused32> errors. see this issue

These configurations for gemma3 27b fit 100K context on:

  • a single 24GB GPU with q4_0 KV quantization
  • dual 24GB GPUs with q8_0 KV quantization (there seems to be a bug with no quantization atm)
  • examples tested on single 3090, single P40, dual 3090 and dual P40s.

Setup

  • Download both the model and the mmproj files from Google's huggingface repo: google/gemma-3-27bit-qat-q4_0
  • Use latest llama-server. Version (b5568) used for these configs.

Config

macros:
  "server-latest":
    /path/to/llama-server/llama-server-latest
    --host 127.0.0.1 --port ${PORT}
    --flash-attn -ngl 999 -ngld 999
    --no-mmap

  "gemma3-args": |
      --model /path/to/models/gemma-3-27b-it-q4_0.gguf
      --temp 1.0
      --repeat-penalty 1.0
      --min-p 0.01
      --top-k 64
      --top-p 0.95

models:
  # fits on a single 24GB GPU w/ 100K context
  # requires Q4 KV quantization, ~22GB VRAM
  "gemma-single":
    cmd: |
      ${server-latest}
      ${gemma3-args}
      --cache-type-k q4_0 
      --cache-type-v q4_0
      --ctx-size 102400
      --mmproj /path/to/models/gemma-mmproj-model-f16-27B.gguf

  # requires ~30GB VRAM
  "gemma":
    cmd: |
      ${server-latest}
      ${gemma3-args}
      --cache-type-k q8_0 
      --cache-type-v q8_0
      --ctx-size 102400
      --mmproj /path/to/models/gemma-mmproj-model-f16-27B.gguf

  # draft model settings
  # --mmproj not compatible with draft models
  # ~32.5 GB VRAM @ 82K context 
  "gemma-draft":
    env:
      # 3090 - 38 tok/sec
      - "CUDA_VISIBLE_DEVICES=GPU-6f0,GPU-f10"
    cmd: |
      ${server-latest}
      ${gemma3-args}
      --cache-type-k q8_0 
      --cache-type-v q8_0
      --ctx-size 102400
      --model-draft /path/to/models/gemma-3-4b-it-q4_0.gguf
      --ctx-size-draft 102400
      --draft-max 8 --draft-min 4

Speculative Decoding Stats

Uses this testing script.

no draft model

prompt n tok/sec draft_n draft_accepted ratio
create a one page html snake game in javascript 1600 38.71 null null N/A
write a snake game in python 1929 38.49 null null N/A
write a story about a dog 859 38.88 null null N/A

4b draft model

prompt n tok/sec draft_n draft_accepted ratio Δ %
create a one page html snake game in javascript 1542 49.07 1422 956 0.67 26.7%
write a snake game in python 1904 50.67 1709 1236 0.72 31.6%
write a story about a dog 982 33.97 1068 282 0.26 -14.4%

Tip

Try path.to.sh.

⚠️ **GitHub.com Fallback** ⚠️