Principle:BerriAI Litellm Cache Lookup
| Knowledge Sources | Cache-aside pattern; read-through caching; LLM response lifecycle management; type coercion patterns |
|---|---|
| Domains | Caching, LLM Infrastructure, Asynchronous Programming |
| Last Updated | 2026-02-15 |
Overview
Cache lookup and response handling is the process of checking a cache for a previously stored LLM response, converting it back to the correct response type, and storing new responses after successful API calls.
Description
The cache lookup and response workflow sits at the heart of the LLM response caching pipeline. It addresses several interrelated problems:
- Pre-call cache check: Before making an expensive API call to an LLM provider, the system checks whether a matching cached response exists. This requires computing the cache key, querying the appropriate backend (potentially asynchronously), and determining whether the result is a hit or miss.
- Dynamic cache control: Per-request directives such as
no-cache(skip reading from cache, but still store the response) andno-store(do not persist the response after the call) allow callers to override caching behavior without changing global configuration. - Call-type-aware deserialization: Cached responses are stored as serialized dictionaries or JSON strings. Upon retrieval, they must be converted back into the appropriate response model (
ModelResponse,TextCompletionResponse,EmbeddingResponse,TranscriptionResponse,RerankResponse,ResponsesAPIResponse) or wrapped in a streaming iterator if the original request specifiedstream=True. - Partial embedding cache hits: For embedding requests with multiple inputs, some inputs may have cached results while others do not. The lookup handler must identify which inputs are cached, return partial results, and rewrite the request to fetch only the uncached inputs from the provider.
- Post-call cache storage: After a successful API response, the result is serialized and written to the cache asynchronously (using
asyncio.create_task) so that the response is returned to the caller without waiting for the cache write to complete. - Observability integration: Cache hits and misses are reported through the logging and callback system, including timing metrics for the cache lookup itself, so that operators can monitor cache effectiveness.
Usage
Use cache lookup and response handling when:
- You are implementing or extending the request lifecycle of an LLM proxy and need to intercept requests before they reach the provider.
- You need to understand or debug why a particular request is or is not returning a cached response.
- You want to implement partial caching for batch embedding requests.
- You are integrating caching with observability systems and need to understand what metadata is attached to cache-hit responses.
Theoretical Basis
Cache lookup follows the cache-aside (lazy-loading) pattern: the application code explicitly checks the cache before calling the origin (the LLM provider), and explicitly writes to the cache after receiving a response. This contrasts with read-through or write-through patterns where the cache itself mediates access to the origin.
Pseudocode:
CLASS LLMCachingHandler:
CONSTRUCTOR(original_function, request_kwargs, start_time):
self.original_function = original_function
self.request_kwargs = request_kwargs
self.start_time = start_time
-- If Redis backend, set up a DualCache (Redis + in-memory) for faster lookups
IF cache backend IS Redis:
self.dual_cache = DualCache(redis_cache, in_memory_cache)
ASYNC FUNCTION get_cache(model, original_function, logging_obj, call_type, kwargs):
-- Guard: check that caching is enabled and no-cache is not set
IF caching is disabled OR no-cache directive is set:
RETURN None
start_timer()
IF call_type is supported by cache:
cached_result = AWAIT retrieve_from_cache(call_type, kwargs)
stop_timer()
IF cached_result is a single result (non-list):
-- Cache HIT for completion/text_completion/transcription/rerank
convert cached_result to appropriate response model
IF request is NOT streaming:
log cache hit on callbacks
attach cache_key to response._hidden_params
RETURN CachingHandlerResponse(cached_result)
ELSE IF call_type is embedding AND cached_result is a list:
-- Partial cache HIT for embeddings
separate cached and uncached inputs
build partial EmbeddingResponse for cached inputs
rewrite kwargs["input"] to contain only uncached inputs
IF all inputs were cached:
log cache hit
RETURN CachingHandlerResponse(all_cached=True, embedding_response)
ELSE:
RETURN CachingHandlerResponse(partial_response=embedding_response)
RETURN CachingHandlerResponse(cached_result=None) -- cache MISS
ASYNC FUNCTION set_cache(result, original_function, kwargs):
-- Guard: check that caching is enabled and no-store is not set
IF cache is None OR no-store directive is set:
RETURN
IF result is a known response type (ModelResponse, EmbeddingResponse, etc.):
IF result is EmbeddingResponse AND backend supports bulk write:
asyncio.create_task(cache.async_add_cache_pipeline(result, **kwargs))
ELSE:
asyncio.create_task(cache.async_add_cache(result.to_json(), **kwargs))
ELSE:
asyncio.create_task(cache.async_add_cache(result, **kwargs))
FUNCTION should_store_result_in_cache(original_function, kwargs) -> bool:
RETURN cache is initialized
AND call type is in supported_call_types
AND no-store directive is NOT set
The key design properties are:
- Non-blocking writes: Cache writes are dispatched as fire-and-forget async tasks. A failed write does not affect the response to the caller.
- Type fidelity: Cached responses are faithfully reconstructed into their original Pydantic model types, including hidden parameters like
cache_hit=True. - Streaming support: When the original request specified streaming, the cached dictionary response is converted into a
CustomStreamWrapperthat yields chunks, preserving the streaming contract. - Dual-cache optimisation: For Redis backends, a DualCache layer checks an in-memory cache first, falling back to Redis only on a local miss. This reduces network round trips for frequently accessed keys.