Docker Performance - amosproj/amos2025ss04-ai-driven-testing GitHub Wiki
Different Strategies for Running Multiple Docker Containers on a Single Machine
When working with multiple containerized language models (LLMs), there are several ways to manage execution depending on your performance, resource usage, and latency needs. Below are three common strategies:
1. Sequential Execution (Start–Use–Stop Pattern)
Description
- Start a container
- Prompt the LLM and retrieve the response
- Stop (or remove) the container
- Repeat for each model
Pros
- Efficient use of system resources: Only one container uses CPU and memory at a time.
- Minimal idle overhead: No unused containers consuming RAM or port bindings.
Cons
- Startup overhead: Starting and stopping containers adds latency for each prompt.
Performance Snapshots:
2. Sequential Prompting with All Containers Pre-Started
Description
- Start all LLM containers at the beginning of the session
- Prompt each model sequentially while they remain running
- Shut down all containers at the end
Pros
- Reduced latency: No container startup delay during prompting.
- Moderate resource control: Only one model is active at a time, but all are ready to respond.
Cons
- Idle resource usage: Containers consume some memory and CPU even when not actively used (It should be basically negligiable, but since Windows uses a lot of my RAM per default, sometimes I had some lags. Shouldn't be a problem on clusters though)
Performance Snapshots:
3. Parallel Execution (Using Threads or Async Routines)
Description
- For each model prompt:
- Start a container in its own thread or async task
- Prompt the model and retrieve the response
- Shut down the container when finished
Pros
- Takes advantage of multiple CPU cores for concurrent execution.
Cons
- High system load: Running several models at once may exceed available resources (especially GPU or RAM).
- Inefficient: LLMs are computationally intensive. Starting and stopping them often adds a lot of unnecessary overhead and during the switch no calculations get done
Performance Snapshot:
Recommendations
- Sequential with Pre-Started Containers is generally the most efficient approach for typical development environments, balancing performance and resource use.
- Sequential Start–Use–Stop is recommended when hardware constraints prevent keeping multiple containers loaded in memory.
- Parallel Execution is not ideal for consumer-grade machines due to its heavy resource demands. It may be appropriate in high-performance computing (HPC) environments, though in such cases, it is often more effective to use scheduling systems like Slurm to manage workloads.