Prompt caching - iwamot/collmbo GitHub Wiki

Usage

To enable prompt caching, set the environment variable PROMPT_CACHING_ENABLED to true.

$ cat env
SLACK_APP_TOKEN=xapp-1-...
SLACK_BOT_TOKEN=xoxb-...
ANTHROPIC_API_KEY=sk-ant-...
LITELLM_MODEL=claude-3-7-sonnet-20250219
PROMPT_CACHING_ENABLED=true

$ docker run -it --env-file ./env ghcr.io/iwamot/collmbo:latest-slim

When enabled, cache breakpoints will automatically be added to the two most recent user messages when the total context size is 1,024 tokens or more. This may help reduce API costs.

Currently, this feature is only supported by models supported by LiteLLM for prompt caching, such as Anthropic Claude and Amazon Bedrock models (e.g., Claude, Amazon Nova).

Checking cache usage

If you want to check whether prompt caching was used, look at the cache_read_input_tokens field in the model response.

You can use examples/callback_handler.py as a reference to log the model response via the log_success_event() callback. For example:

def log_success_event(self, kwargs, response_obj, start_time, end_time):
    if "complete_streaming_response" in kwargs:
        print(
            f"complete streaming response: {kwargs['complete_streaming_response']}"
        )

A sample output might look like this:

complete streaming response: ModelResponse(usage=Usage(..., cache_read_input_tokens=8516))