Possible Ways to run Ollama and Settings - amosproj/amos2025ss04-ai-driven-testing GitHub Wiki

Ollama Integration – Project Notes

Aspect	Docker	Native
GPU Access	Requires explicit allowance using `nvidia-container-toolkit`	Direct access
Model Storage	Must mount volume to persist models	Stored directly in filesystem
Port Mapping	Must be configured explicitly	Automatically handled by host
Environment Consistency	Fully containerized and reproducible	Dependent on correct host configuration
Model Isolation	Each model in its own container	All models coexist
Performance	Slightly slower due to container overhead	A bit faster

Defines and builds a configured model.
Combines a base model and settings into a custom tag:
Example: llama3:8b + custom settings → llama3:mycustomllm

Quantization: Simple to apply, may improve performance slightly. (Essentially trades a little bit of quality of response for faster runtime)
Concurrency: Ollama supports multiple parallel requests within a single container.
Environment variables to adjust behavior:
- OLLAMA_NUM_PARALLEL
- OLLAMA_MAX_LOADED_MODELS
- OLLAMA_MAX_QUEUE
Spawning separate containers per request probably offers simpler scaling and isolation.
Built-in performance benchmarks are only available when running Ollama natively (not via Docker).