KV cache - AshokBhat/ml GitHub Wiki

About

  • Optimization technique for LLMs that speeds up text generation.
  • Instead of recalculating attention matrices for all previous tokens from scratch with every generated word, the model stores and reuses past Key (K) and Value (V) vectors.

Why it Matters

  • Autoregressive models process text sequentially. When the model predicts the next token in a sequence, it must calculate attention scores by comparing the current token against the entire history of preceding tokens. Without caching, the model would perform redundant calculations, resulting in generation times that scale quadratically as the output grows longer.By saving previous Keys and Values into memory, the computation time per newly generated token becomes linear, dramatically reducing latency.

How it Works

  • LLM inference is divided into two distinct phases:

Prefill Phase

The model processes the entire input prompt at once. It computes Key, Value, and Query states for every token and stores the resulting K and V tensors in the cache.

Decode Phase

When generating new tokens one by one, the model only computes the K and V for the newly generated token and appends them to the existing cache. It then uses the Query (Q) vector of the new token to compute attention against all previously cached K/V pairs.The Trade-off: Memory vs. SpeedWhile the speed improvements of KV caching are vital for real-time applications, the memory footprint scales linearly with context length.High VRAM Usage: The KV cache takes up significant space in high-bandwidth memory.Advanced Management: To handle longer context windows, modern serving engines (such as vLLM and TensorRT-LLM) use sophisticated management techniques like continuous batching, paged attention (similar to OS memory paging), and hierarchical offloading to system RAM or NVMe SSDs

See also