Docker Performance - amosproj/amos2025ss04-ai-driven-testing GitHub Wiki

Different Strategies for Running Multiple Docker Containers on a Single Machine

When working with multiple containerized language models (LLMs), there are several ways to manage execution depending on your performance, resource usage, and latency needs. Below are three common strategies:

1. Sequential Execution (Start–Use–Stop Pattern)

Description

Start a container
Prompt the LLM and retrieve the response
Stop (or remove) the container
Repeat for each model

Pros

Efficient use of system resources: Only one container uses CPU and memory at a time.
Minimal idle overhead: No unused containers consuming RAM or port bindings.

Cons

Startup overhead: Starting and stopping containers adds latency for each prompt.

Performance Snapshots:

2. Sequential Prompting with All Containers Pre-Started

Description

Start all LLM containers at the beginning of the session
Prompt each model sequentially while they remain running
Shut down all containers at the end

Pros

Reduced latency: No container startup delay during prompting.
Moderate resource control: Only one model is active at a time, but all are ready to respond.

Cons

Idle resource usage: Containers consume some memory and CPU even when not actively used (It should be basically negligiable, but since Windows uses a lot of my RAM per default, sometimes I had some lags. Shouldn't be a problem on clusters though)

Performance Snapshots:

3. Parallel Execution (Using Threads or Async Routines)

Description

For each model prompt:
- Start a container in its own thread or async task
- Prompt the model and retrieve the response
- Shut down the container when finished

Pros

Takes advantage of multiple CPU cores for concurrent execution.

Cons

High system load: Running several models at once may exceed available resources (especially GPU or RAM).
Inefficient: LLMs are computationally intensive. Starting and stopping them often adds a lot of unnecessary overhead and during the switch no calculations get done

Performance Snapshot:

Recommendations

Sequential with Pre-Started Containers is generally the most efficient approach for typical development environments, balancing performance and resource use.
Sequential Start–Use–Stop is recommended when hardware constraints prevent keeping multiple containers loaded in memory.
Parallel Execution is not ideal for consumer-grade machines due to its heavy resource demands. It may be appropriate in high-performance computing (HPC) environments, though in such cases, it is often more effective to use scheduling systems like Slurm to manage workloads.