Possible Ways to run Ollama and Settings - amosproj/amos2025ss04-ai-driven-testing GitHub Wiki
Ollama Integration – Project Notes
Running Ollama in Docker vs Natively
Aspect | Docker | Native |
---|---|---|
GPU Access | Requires explicit allowance using nvidia-container-toolkit |
Direct access |
Model Storage | Must mount volume to persist models | Stored directly in filesystem |
Port Mapping | Must be configured explicitly | Automatically handled by host |
Environment Consistency | Fully containerized and reproducible | Dependent on correct host configuration |
Model Isolation | Each model in its own container | All models coexist |
Performance | Slightly slower due to container overhead | A bit faster |
Ollama Settings and Capabilities
Modelfile
- Defines and builds a configured model.
- Combines a base model and settings into a custom tag:
Example:llama3:8b
+ custom settings →llama3:mycustomllm
API Endpoints
POST /api/generate
- One-time prompt execution.
- Key request parameters:
format
: Forces output format (as defined in a JSON schema); useful for classification but probably less for code responsesoptions.seed
: Enables deterministic responses for reproducibility.options.num_ctx
: Defines context window size (token memory). The longer the better the output, but bigger context windows also means more calculationoptions.temperature
: Controls response creativity (0
is deterministic, higher values increase variation).options.num_predict
: Sets maximum number of output tokens.- Sending an empty request preloads the model without generating output.
- Some LLMs can handle images, generally supported by Ollama
POST /api/chat
-
Similar to
generate
, but supports multi-turn interaction. -
Accepts a
messages
array representing prior dialogue like: -
Can be configured with callable tools (e.g., functions).
POST /api/show
- provides metadata for the currently loaded model
Performance and Scaling Considerations
- Quantization: Simple to apply, may improve performance slightly. (Essentially trades a little bit of quality of response for faster runtime)
- Concurrency: Ollama supports multiple parallel requests within a single container.
Environment variables to adjust behavior:OLLAMA_NUM_PARALLEL
OLLAMA_MAX_LOADED_MODELS
OLLAMA_MAX_QUEUE
- Spawning separate containers per request probably offers simpler scaling and isolation.
- Built-in performance benchmarks are only available when running Ollama natively (not via Docker).
Additional Resources
- Example usage:
https://github.com/ollama/ollama-python/tree/main/examples - Official Ollama README:
https://github.com/ollama/ollama/blob/main/README.md - Official Ollama Docs:
https://github.com/ollama/ollama/blob/main/docs/README.md