Principle:BerriAI Litellm Cache Lookup

Knowledge Sources	Cache-aside pattern; read-through caching; LLM response lifecycle management; type coercion patterns
Domains	Caching, LLM Infrastructure, Asynchronous Programming
Last Updated	2026-02-15

Overview

Cache lookup and response handling is the process of checking a cache for a previously stored LLM response, converting it back to the correct response type, and storing new responses after successful API calls.

Description

The cache lookup and response workflow sits at the heart of the LLM response caching pipeline. It addresses several interrelated problems:

Pre-call cache check: Before making an expensive API call to an LLM provider, the system checks whether a matching cached response exists. This requires computing the cache key, querying the appropriate backend (potentially asynchronously), and determining whether the result is a hit or miss.
Dynamic cache control: Per-request directives such as no-cache (skip reading from cache, but still store the response) and no-store (do not persist the response after the call) allow callers to override caching behavior without changing global configuration.
Call-type-aware deserialization: Cached responses are stored as serialized dictionaries or JSON strings. Upon retrieval, they must be converted back into the appropriate response model (ModelResponse, TextCompletionResponse, EmbeddingResponse, TranscriptionResponse, RerankResponse, ResponsesAPIResponse) or wrapped in a streaming iterator if the original request specified stream=True.
Partial embedding cache hits: For embedding requests with multiple inputs, some inputs may have cached results while others do not. The lookup handler must identify which inputs are cached, return partial results, and rewrite the request to fetch only the uncached inputs from the provider.
Post-call cache storage: After a successful API response, the result is serialized and written to the cache asynchronously (using asyncio.create_task) so that the response is returned to the caller without waiting for the cache write to complete.
Observability integration: Cache hits and misses are reported through the logging and callback system, including timing metrics for the cache lookup itself, so that operators can monitor cache effectiveness.

Usage

Use cache lookup and response handling when:

You are implementing or extending the request lifecycle of an LLM proxy and need to intercept requests before they reach the provider.
You need to understand or debug why a particular request is or is not returning a cached response.
You want to implement partial caching for batch embedding requests.
You are integrating caching with observability systems and need to understand what metadata is attached to cache-hit responses.

Theoretical Basis

Cache lookup follows the cache-aside (lazy-loading) pattern: the application code explicitly checks the cache before calling the origin (the LLM provider), and explicitly writes to the cache after receiving a response. This contrasts with read-through or write-through patterns where the cache itself mediates access to the origin.

Pseudocode:

CLASS LLMCachingHandler:
    CONSTRUCTOR(original_function, request_kwargs, start_time):
        self.original_function = original_function
        self.request_kwargs = request_kwargs
        self.start_time = start_time
        -- If Redis backend, set up a DualCache (Redis + in-memory) for faster lookups
        IF cache backend IS Redis:
            self.dual_cache = DualCache(redis_cache, in_memory_cache)

    ASYNC FUNCTION get_cache(model, original_function, logging_obj, call_type, kwargs):
        -- Guard: check that caching is enabled and no-cache is not set
        IF caching is disabled OR no-cache directive is set:
            RETURN None

        start_timer()

        IF call_type is supported by cache:
            cached_result = AWAIT retrieve_from_cache(call_type, kwargs)
            stop_timer()

            IF cached_result is a single result (non-list):
                -- Cache HIT for completion/text_completion/transcription/rerank
                convert cached_result to appropriate response model
                IF request is NOT streaming:
                    log cache hit on callbacks
                attach cache_key to response._hidden_params
                RETURN CachingHandlerResponse(cached_result)

            ELSE IF call_type is embedding AND cached_result is a list:
                -- Partial cache HIT for embeddings
                separate cached and uncached inputs
                build partial EmbeddingResponse for cached inputs
                rewrite kwargs["input"] to contain only uncached inputs
                IF all inputs were cached:
                    log cache hit
                    RETURN CachingHandlerResponse(all_cached=True, embedding_response)
                ELSE:
                    RETURN CachingHandlerResponse(partial_response=embedding_response)

        RETURN CachingHandlerResponse(cached_result=None)  -- cache MISS


    ASYNC FUNCTION set_cache(result, original_function, kwargs):
        -- Guard: check that caching is enabled and no-store is not set
        IF cache is None OR no-store directive is set:
            RETURN

        IF result is a known response type (ModelResponse, EmbeddingResponse, etc.):
            IF result is EmbeddingResponse AND backend supports bulk write:
                asyncio.create_task(cache.async_add_cache_pipeline(result, **kwargs))
            ELSE:
                asyncio.create_task(cache.async_add_cache(result.to_json(), **kwargs))
        ELSE:
            asyncio.create_task(cache.async_add_cache(result, **kwargs))


    FUNCTION should_store_result_in_cache(original_function, kwargs) -> bool:
        RETURN cache is initialized
               AND call type is in supported_call_types
               AND no-store directive is NOT set

The key design properties are:

Non-blocking writes: Cache writes are dispatched as fire-and-forget async tasks. A failed write does not affect the response to the caller.
Type fidelity: Cached responses are faithfully reconstructed into their original Pydantic model types, including hidden parameters like cache_hit=True.
Streaming support: When the original request specified streaming, the cached dictionary response is converted into a CustomStreamWrapper that yields chunks, preserving the streaming contract.
Dual-cache optimisation: For Redis backends, a DualCache layer checks an in-memory cache first, falling back to Redis only on a local miss. This reduces network round trips for frequently accessed keys.

Related Pages

Implementation:BerriAI_Litellm_Caching_Handler

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment