gemma3 27b 100k context - mostlygeek/llama-swap GitHub Wiki
Warning
There is a currently a bug with gemma3 at high context that produces <unused32>
errors.
see this issue
These configurations for gemma3 27b fit 100K context on:
- a single 24GB GPU with q4_0 KV quantization
- dual 24GB GPUs with q8_0 KV quantization (there seems to be a bug with no quantization atm)
- examples tested on single 3090, single P40, dual 3090 and dual P40s.
- Download both the model and the mmproj files from Google's huggingface repo: google/gemma-3-27bit-qat-q4_0
- Use latest llama-server. Version (b5568) used for these configs.
macros:
"server-latest":
/path/to/llama-server/llama-server-latest
--host 127.0.0.1 --port ${PORT}
--flash-attn -ngl 999 -ngld 999
--no-mmap
"gemma3-args": |
--model /path/to/models/gemma-3-27b-it-q4_0.gguf
--temp 1.0
--repeat-penalty 1.0
--min-p 0.01
--top-k 64
--top-p 0.95
models:
# fits on a single 24GB GPU w/ 100K context
# requires Q4 KV quantization, ~22GB VRAM
"gemma-single":
cmd: |
${server-latest}
${gemma3-args}
--cache-type-k q4_0
--cache-type-v q4_0
--ctx-size 102400
--mmproj /path/to/models/gemma-mmproj-model-f16-27B.gguf
# requires ~30GB VRAM
"gemma":
cmd: |
${server-latest}
${gemma3-args}
--cache-type-k q8_0
--cache-type-v q8_0
--ctx-size 102400
--mmproj /path/to/models/gemma-mmproj-model-f16-27B.gguf
# draft model settings
# --mmproj not compatible with draft models
# ~32.5 GB VRAM @ 82K context
"gemma-draft":
env:
# 3090 - 38 tok/sec
- "CUDA_VISIBLE_DEVICES=GPU-6f0,GPU-f10"
cmd: |
${server-latest}
${gemma3-args}
--cache-type-k q8_0
--cache-type-v q8_0
--ctx-size 102400
--model-draft /path/to/models/gemma-3-4b-it-q4_0.gguf
--ctx-size-draft 102400
--draft-max 8 --draft-min 4
Uses this testing script.
prompt | n | tok/sec | draft_n | draft_accepted | ratio |
---|---|---|---|---|---|
create a one page html snake game in javascript | 1600 | 38.71 | null | null | N/A |
write a snake game in python | 1929 | 38.49 | null | null | N/A |
write a story about a dog | 859 | 38.88 | null | null | N/A |
prompt | n | tok/sec | draft_n | draft_accepted | ratio | Δ % |
---|---|---|---|---|---|---|
create a one page html snake game in javascript | 1542 | 49.07 | 1422 | 956 | 0.67 | 26.7% |
write a snake game in python | 1904 | 50.67 | 1709 | 1236 | 0.72 | 31.6% |
write a story about a dog | 982 | 33.97 | 1068 | 282 | 0.26 | -14.4% |
Tip
Try path.to.sh.