ollama:option - chunhualiao/public-docs GitHub Wiki

ollama

see also LiteLLM:option

When using models with Ollama (via the CLI command ollama run, the REST API endpoints like /api/generate or /api/chat, or client libraries), you can configure inference hyperparameters through two main mechanisms:

  1. Default/fixed settings — Baked into the model when you create it (via a Modelfile using ollama create).
  2. Runtime overrides — Passed per-request (e.g., temporarily overriding temperature for a single query without changing the model).

Both use the same set of options, derived from llama.cpp (the backend Ollama uses).

Full List of Available Hyperparameters (Options)

These are the current supported options (as of late 2025; Ollama follows llama.cpp updates closely):

Parameter Type Default Description
num_ctx int 2048 or model-specific Context window size in tokens (how much history the model "sees"). Larger values use more VRAM/RAM.
temperature float 0.8 Randomness/creativity (0 = deterministic, higher = more creative).
top_k int 40 Limits sampling to the top K most probable tokens.
top_p float 0.9 Nucleus sampling: considers the smallest set of tokens whose cumulative probability exceeds p.
min_p float 0.0 Minimum probability threshold relative to the most likely token (newer addition for diversity).
repeat_penalty float 1.1 Penalizes repetition of recent tokens.
repeat_last_n int 64 How many recent tokens to consider for repeat_penalty (-1 = entire context).
num_predict int -1 (unlimited) / -2 (fill context) Maximum tokens to generate (-1 means "keep going until stop token or context full", but Ollama caps it internally for safety).
num_keep int 0 Number of initial tokens to always keep (useful for forced prefixes).
seed int random Random seed for reproducible outputs (use same seed + temperature 0 for fully deterministic).
stop string/array model-specific Stop sequences (the model stops generating when it hits one).
tfs_z float 1.0 Tail Free Sampling zeta – reduces probability of low-probability tokens.
typical_p float 1.0 Local typicality sampling.
presence_penalty float 0.0 Penalizes tokens that already appeared anywhere in the context.
frequency_penalty float 0.0 Penalizes tokens based on how frequently they appeared in the context.
mirostat int 0 (disabled) Mirostat sampling mode (0=disabled, 1=original, 2=v2).
mirostat_tau float 5.0 Target surprise/perplexity for Mirostat.
mirostat_eta float 0.1 Learning rate for Mirostat.
num_thread int auto Number of CPU threads to use (usually leave auto).
num_gpu int auto Number of GPU layers to offload (or -1 = all possible).
main_gpu int 0 Primary GPU index when using multiple GPUs.
low_vram bool false Enable techniques to reduce VRAM usage (slower).
num_batch int 512 Batch size for prompt processing.
num_keep int 0 Tokens to always keep from the prompt.
logit_bias map none Token ID → bias value to adjust likelihood of specific tokens.

There are also some less-common/advanced options like grammar, json_mode, etc., that appear in newer Ollama/llama.cpp versions.

How to Set Them

1. In a Modelfile (permanent defaults)

FROM llama3.2
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
PARAMETER top_p 0.95
PARAMETER repeat_penalty 1.15
PARAMETER stop "<|eot_id|>"

Then run ollama create my-llama3 -f Modelfile

The model `my-llama3` will now always use those defaults.

**2. At runtime via CLI**
```bash
ollama run llama3 --num_ctx 8192 --temperature 0.7
# or for a single prompt when already running the model
# Why is the sky blue? /options num_ctx=16384 temperature=0.2

3. Via the REST API (most common for apps)

{
  "model": "llama3",
  "prompt": "Explain quantum computing",
  "options": {
    "temperature": 0.2,
    "num_ctx": 8192,
    "top_p": 0.95,
    "num_predict": 512
  }
}

4. In client libraries (Python, JS, etc.)

from ollama import chat
chat(
  model='llama3',
  messages=[{ 'role': 'user', 'content': 'Hello' }],
  options={
    'temperature': 0.1,
    'num_ctx': 16384
  }
)

These options give you full control over sampling, context length, repetition handling, reproducibility, and hardware usage when running any model through Ollama. The exact defaults can vary slightly per base model (e.g., Llama 3.2 defaults to 128k context if the quantization supports it, but Ollama often ships with a lower default like 8k or 4k to save memory). You can always override them at runtime. For the very latest list, check the official Ollama GitHub repo → docs/modelfile.md and docs/api.md or run ollama show --modelfile <model> to see what a particular model is using.

⚠️ **GitHub.com Fallback** ⚠️