ollama:option - chunhualiao/public-docs GitHub Wiki
see also LiteLLM:option
When using models with Ollama (via the CLI command ollama run, the REST API endpoints like /api/generate or /api/chat, or client libraries), you can configure inference hyperparameters through two main mechanisms:
-
Default/fixed settings — Baked into the model when you create it (via a Modelfile using
ollama create). - Runtime overrides — Passed per-request (e.g., temporarily overriding temperature for a single query without changing the model).
Both use the same set of options, derived from llama.cpp (the backend Ollama uses).
These are the current supported options (as of late 2025; Ollama follows llama.cpp updates closely):
| Parameter | Type | Default | Description |
|---|---|---|---|
num_ctx |
int | 2048 or model-specific | Context window size in tokens (how much history the model "sees"). Larger values use more VRAM/RAM. |
temperature |
float | 0.8 | Randomness/creativity (0 = deterministic, higher = more creative). |
top_k |
int | 40 | Limits sampling to the top K most probable tokens. |
top_p |
float | 0.9 | Nucleus sampling: considers the smallest set of tokens whose cumulative probability exceeds p. |
min_p |
float | 0.0 | Minimum probability threshold relative to the most likely token (newer addition for diversity). |
repeat_penalty |
float | 1.1 | Penalizes repetition of recent tokens. |
repeat_last_n |
int | 64 | How many recent tokens to consider for repeat_penalty (-1 = entire context). |
num_predict |
int | -1 (unlimited) / -2 (fill context) | Maximum tokens to generate (-1 means "keep going until stop token or context full", but Ollama caps it internally for safety). |
num_keep |
int | 0 | Number of initial tokens to always keep (useful for forced prefixes). |
seed |
int | random | Random seed for reproducible outputs (use same seed + temperature 0 for fully deterministic). |
stop |
string/array | model-specific | Stop sequences (the model stops generating when it hits one). |
tfs_z |
float | 1.0 | Tail Free Sampling zeta – reduces probability of low-probability tokens. |
typical_p |
float | 1.0 | Local typicality sampling. |
presence_penalty |
float | 0.0 | Penalizes tokens that already appeared anywhere in the context. |
frequency_penalty |
float | 0.0 | Penalizes tokens based on how frequently they appeared in the context. |
mirostat |
int | 0 (disabled) | Mirostat sampling mode (0=disabled, 1=original, 2=v2). |
mirostat_tau |
float | 5.0 | Target surprise/perplexity for Mirostat. |
mirostat_eta |
float | 0.1 | Learning rate for Mirostat. |
num_thread |
int | auto | Number of CPU threads to use (usually leave auto). |
num_gpu |
int | auto | Number of GPU layers to offload (or -1 = all possible). |
main_gpu |
int | 0 | Primary GPU index when using multiple GPUs. |
low_vram |
bool | false | Enable techniques to reduce VRAM usage (slower). |
num_batch |
int | 512 | Batch size for prompt processing. |
num_keep |
int | 0 | Tokens to always keep from the prompt. |
logit_bias |
map | none | Token ID → bias value to adjust likelihood of specific tokens. |
There are also some less-common/advanced options like grammar, json_mode, etc., that appear in newer Ollama/llama.cpp versions.
1. In a Modelfile (permanent defaults)
FROM llama3.2
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
PARAMETER top_p 0.95
PARAMETER repeat_penalty 1.15
PARAMETER stop "<|eot_id|>"
Then run ollama create my-llama3 -f Modelfile
The model `my-llama3` will now always use those defaults.
**2. At runtime via CLI**
```bash
ollama run llama3 --num_ctx 8192 --temperature 0.7
# or for a single prompt when already running the model
# Why is the sky blue? /options num_ctx=16384 temperature=0.2
3. Via the REST API (most common for apps)
{
"model": "llama3",
"prompt": "Explain quantum computing",
"options": {
"temperature": 0.2,
"num_ctx": 8192,
"top_p": 0.95,
"num_predict": 512
}
}4. In client libraries (Python, JS, etc.)
from ollama import chat
chat(
model='llama3',
messages=[{ 'role': 'user', 'content': 'Hello' }],
options={
'temperature': 0.1,
'num_ctx': 16384
}
)These options give you full control over sampling, context length, repetition handling, reproducibility, and hardware usage when running any model through Ollama. The exact defaults can vary slightly per base model (e.g., Llama 3.2 defaults to 128k context if the quantization supports it, but Ollama often ships with a lower default like 8k or 4k to save memory). You can always override them at runtime. For the very latest list, check the official Ollama GitHub repo → docs/modelfile.md and docs/api.md or run ollama show --modelfile <model> to see what a particular model is using.