Possible Ways to run Ollama and Settings - amosproj/amos2025ss04-ai-driven-testing GitHub Wiki

Ollama Integration – Project Notes

Running Ollama in Docker vs Natively

Aspect Docker Native
GPU Access Requires explicit allowance using nvidia-container-toolkit Direct access
Model Storage Must mount volume to persist models Stored directly in filesystem
Port Mapping Must be configured explicitly Automatically handled by host
Environment Consistency Fully containerized and reproducible Dependent on correct host configuration
Model Isolation Each model in its own container All models coexist
Performance Slightly slower due to container overhead A bit faster

Ollama Settings and Capabilities

Modelfile

  • Defines and builds a configured model.
  • Combines a base model and settings into a custom tag:
    Example: llama3:8b + custom settings → llama3:mycustomllm

API Endpoints

POST /api/generate
  • One-time prompt execution.
  • Key request parameters:
    • format: Forces output format (as defined in a JSON schema); useful for classification but probably less for code responses
    • options.seed: Enables deterministic responses for reproducibility.
    • options.num_ctx: Defines context window size (token memory). The longer the better the output, but bigger context windows also means more calculation
    • options.temperature: Controls response creativity (0 is deterministic, higher values increase variation).
    • options.num_predict: Sets maximum number of output tokens.
    • Sending an empty request preloads the model without generating output.
    • Some LLMs can handle images, generally supported by Ollama
POST /api/chat
  • Similar to generate, but supports multi-turn interaction.

  • Accepts a messages array representing prior dialogue like:

    grafik

  • Can be configured with callable tools (e.g., functions).

POST /api/show
  • provides metadata for the currently loaded model

Performance and Scaling Considerations

  • Quantization: Simple to apply, may improve performance slightly. (Essentially trades a little bit of quality of response for faster runtime)
  • Concurrency: Ollama supports multiple parallel requests within a single container.
    Environment variables to adjust behavior:
    • OLLAMA_NUM_PARALLEL
    • OLLAMA_MAX_LOADED_MODELS
    • OLLAMA_MAX_QUEUE
  • Spawning separate containers per request probably offers simpler scaling and isolation.
  • Built-in performance benchmarks are only available when running Ollama natively (not via Docker).

Additional Resources