gemma3 27b 100k context - mostlygeek/llama-swap GitHub Wiki

Warning

There is a currently a bug with gemma3 at high context that produces <unused32> errors. see this issue

These configurations for gemma3 27b fit 100K context on:

a single 24GB GPU with q4_0 KV quantization
dual 24GB GPUs with q8_0 KV quantization (there seems to be a bug with no quantization atm)
examples tested on single 3090, single P40, dual 3090 and dual P40s.

Setup

Download both the model and the mmproj files from Google's huggingface repo: google/gemma-3-27bit-qat-q4_0
Use latest llama-server. Version (b5568) used for these configs.

Config

macros:
  "server-latest":
    /path/to/llama-server/llama-server-latest
    --host 127.0.0.1 --port ${PORT}
    --flash-attn -ngl 999 -ngld 999
    --no-mmap

  "gemma3-args": |
      --model /path/to/models/gemma-3-27b-it-q4_0.gguf
      --temp 1.0
      --repeat-penalty 1.0
      --min-p 0.01
      --top-k 64
      --top-p 0.95

models:
  # fits on a single 24GB GPU w/ 100K context
  # requires Q4 KV quantization, ~22GB VRAM
  "gemma-single":
    cmd: |
      ${server-latest}
      ${gemma3-args}
      --cache-type-k q4_0 
      --cache-type-v q4_0
      --ctx-size 102400
      --mmproj /path/to/models/gemma-mmproj-model-f16-27B.gguf

  # requires ~30GB VRAM
  "gemma":
    cmd: |
      ${server-latest}
      ${gemma3-args}
      --cache-type-k q8_0 
      --cache-type-v q8_0
      --ctx-size 102400
      --mmproj /path/to/models/gemma-mmproj-model-f16-27B.gguf

  # draft model settings
  # --mmproj not compatible with draft models
  # ~32.5 GB VRAM @ 82K context 
  "gemma-draft":
    env:
      # 3090 - 38 tok/sec
      - "CUDA_VISIBLE_DEVICES=GPU-6f0,GPU-f10"
    cmd: |
      ${server-latest}
      ${gemma3-args}
      --cache-type-k q8_0 
      --cache-type-v q8_0
      --ctx-size 102400
      --model-draft /path/to/models/gemma-3-4b-it-q4_0.gguf
      --ctx-size-draft 102400
      --draft-max 8 --draft-min 4

Speculative Decoding Stats

Uses this testing script.

no draft model

prompt	n	tok/sec	draft_n	draft_accepted	ratio
create a one page html snake game in javascript	1600	38.71	null	null	N/A
write a snake game in python	1929	38.49	null	null	N/A
write a story about a dog	859	38.88	null	null	N/A

4b draft model

prompt	n	tok/sec	draft_n	draft_accepted	ratio	Δ %
create a one page html snake game in javascript	1542	49.07	1422	956	0.67	26.7%
write a snake game in python	1904	50.67	1709	1236	0.72	31.6%
write a story about a dog	982	33.97	1068	282	0.26	-14.4%

Tip

Try path.to.sh.

gemma3 27b 100k context - mostlygeek/llama-swap GitHub Wiki

Setup

Config

Speculative Decoding Stats

no draft model

4b draft model

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️