The Cold Start Time To First Token (CS‐TTFT) of InferX snapshot based container - inferx-net/inferx GitHub Wiki

The InferX snapshot based cold start has 2 design principals:

Minized the CS-TTFT (Cold Start Time To First Token): the time between first request reach gateway and the first token return. In the production environment, the design goal is to limited CS-TTFT < 5 second.
Zero GPU usage before GPU Cold Start: Before serve the user request, the model container won't take ANY GPU resource includes GPU memory and GPU Context This is a lossy version of CPU Cold Start in which serving container doesn't start before cold start. The GPU Cold Start does a tradeoff between CPU memory and CS-TTFT.

The CS-TTFT of a snapshot based container includes following steps:

Container startup: Normal container start up such as container file system preparation,container network registration.
Snapshot metadata loading: Loading snapshot metadata such as how many GPUs needed, GPU memory size
Data loading: initialize GPU and load GPU data, pinned data and pageable data.
First request TTFT: process first request's TTFT. The first request TTFT after cold start latenccy will be higher than following requests.

As the Step#1 and #2 won't take any GPU resource, InferX will "pre-warm" the container before the user request comming, i.e. start the container to finish Step#1 and Step#2 and blocks at before Step#3. Per our test, the Step#1 and #2 will take around 1-3 seconds, the optimization will save the 1-3 seconds from CS-TTFT. The Standby Container will takes about 200-400MB CPU memory.

The model serving container has following states as the :

Start --> Standby: It is triggerd by scheduler when scheduler decides to Pre-Warm a new Standby Container. It will start the container and finish Step#1 and #2;
Standby --> Running: When new request reach Gateway and there is no Running or Idle Container, the scheduler will do the Cold Start for the a Standby Container. The Cold Start incluldes Step#3 and #4.
Running --> Idle: When all the requests for the Running Container are served, the Running enters Idle state
Idle --> Running: When a new request comming and there are a Idle container for the model, scheduler will let the Idle Container enters Running state.
Idle --> End: When a new request comming and the scheduler find there is free GPU to do Cold Start, scheduler will find a Idle Container to evicate it to free GPU resource. In the Evication process, the Idle Container will be killed as normal container killing and free all the CPU and GPU resource.