Implementation:BerriAI Litellm Caching Handler
| Knowledge Sources | https://github.com/BerriAI/litellm |
|---|---|
| Domains | Caching, LLM Infrastructure, Asynchronous Programming |
| Last Updated | 2026-02-15 |
Overview
Concrete handler for looking up cached LLM responses and storing new responses provided by the LLMCachingHandler class.
Description
The LLMCachingHandler class in litellm/caching/caching_handler.py is the wrapper that sits between the LLM API call lifecycle and the underlying cache backends. It exposes two primary methods:
_async_get_cache: Checks whether caching is enabled and theno-cachedirective is absent, then calls_retrieve_from_cacheto query the backend. For non-embedding call types, a cache hit triggers type conversion (via_convert_cached_result_to_model_response), callback logging, and returns aCachingHandlerResponse. For embedding call types, it supports partial cache hits: each input in the embedding list is checked individually (in parallel usingasyncio.gather), and the results are split into cached and uncached partitions.
async_set_cache: After a successful LLM API response, this method checks theno-storedirective and call type eligibility, then asynchronously writes the result to the cache.EmbeddingResponseobjects use the bulk pipeline write (async_add_cache_pipeline) for efficiency, while other response types are serialized viamodel_dump_json()and written individually. All writes are dispatched as non-blockingasyncio.create_taskcalls.
The class also provides synchronous counterparts (_sync_get_cache, sync_set_cache) for non-async code paths. A supporting CachingHandlerResponse Pydantic model is used to return both the cached result and partial embedding state.
Usage
The LLMCachingHandler is instantiated internally by the LiteLLM core for each LLM API call. It is not typically imported directly by end users but is critical for understanding the caching lifecycle. It is instantiated with the original function, request kwargs, and start time, and is used before and after the API call.
Code Reference
| Source Location | litellm/caching/caching_handler.py, lines 89-902
|
|---|---|
| Signature (_async_get_cache) | LLMCachingHandler._async_get_cache(self, model: str, original_function: Callable, logging_obj: LiteLLMLoggingObj, start_time: datetime.datetime, call_type: str, kwargs: Dict[str, Any], args: Optional[Tuple[Any, ...]] = None) -> Optional[CachingHandlerResponse]
|
| Signature (async_set_cache) | LLMCachingHandler.async_set_cache(self, result: Any, original_function: Callable, kwargs: Dict[str, Any], args: Optional[Tuple[Any, ...]] = None) -> None
|
| Import | from litellm.caching.caching_handler import LLMCachingHandler
|
Constructor:
class LLMCachingHandler:
def __init__(
self,
original_function: Callable,
request_kwargs: Dict[str, Any],
start_time: datetime.datetime,
):
self.async_streaming_chunks: List[ModelResponse] = []
self.sync_streaming_chunks: List[ModelResponse] = []
self.request_kwargs = request_kwargs
self.original_function = original_function
self.start_time = start_time
if litellm.cache is not None and isinstance(litellm.cache.cache, RedisCache):
self.dual_cache: Optional[DualCache] = DualCache(
redis_cache=litellm.cache.cache,
in_memory_cache=in_memory_cache_obj,
)
else:
self.dual_cache = None
CachingHandlerResponse:
class CachingHandlerResponse(BaseModel):
cached_result: Optional[Any] = None
final_embedding_cached_response: Optional[EmbeddingResponse] = None
embedding_all_elements_cache_hit: bool = False
I/O Contract
Inputs for _async_get_cache:
| Parameter | Type | Description |
|---|---|---|
model |
str |
The model identifier for the LLM request |
original_function |
Callable |
The original LLM API function (e.g., acompletion, aembedding)
|
logging_obj |
LiteLLMLoggingObj |
The logging/callback manager for this request |
start_time |
datetime.datetime |
The timestamp when the request started |
call_type |
str |
The type of call (e.g., "acompletion", "aembedding")
|
kwargs |
Dict[str, Any] |
The full set of keyword arguments for the LLM call |
args |
Optional[Tuple[Any, ...]] |
Optional positional arguments |
Outputs for _async_get_cache:
| Return Type | Description |
|---|---|
Optional[CachingHandlerResponse] |
Returns None if caching is disabled. Returns a CachingHandlerResponse with cached_result on hit (or None on miss), final_embedding_cached_response for partial embedding hits, and embedding_all_elements_cache_hit flag.
|
Inputs for async_set_cache:
| Parameter | Type | Description |
|---|---|---|
result |
Any |
The LLM API response (ModelResponse, EmbeddingResponse, TranscriptionResponse, RerankResponse, ResponsesAPIResponse, or raw result)
|
original_function |
Callable |
The original LLM API function |
kwargs |
Dict[str, Any] |
The full set of keyword arguments for the LLM call |
args |
Optional[Tuple[Any, ...]] |
Optional positional arguments |
Outputs for async_set_cache:
| Return Type | Description |
|---|---|
None |
The method dispatches cache writes as non-blocking asyncio.create_task fire-and-forget operations and returns immediately.
|
Usage Examples
Typical lifecycle within the LiteLLM core (simplified):
import datetime
from litellm.caching.caching_handler import LLMCachingHandler
# 1. Instantiate the handler at the start of an LLM call
caching_handler = LLMCachingHandler(
original_function=litellm.acompletion,
request_kwargs=kwargs,
start_time=datetime.datetime.now(),
)
# 2. Check cache before calling the provider
cache_response = await caching_handler._async_get_cache(
model="gpt-4",
original_function=litellm.acompletion,
logging_obj=logging_obj,
start_time=start_time,
call_type="acompletion",
kwargs=kwargs,
)
if cache_response is not None and cache_response.cached_result is not None:
# Cache HIT -- return the cached result directly
return cache_response.cached_result
# 3. Cache MISS -- call the LLM provider
result = await litellm.acompletion(**kwargs)
# 4. Store the result in the cache for future requests
await caching_handler.async_set_cache(
result=result,
original_function=litellm.acompletion,
kwargs=kwargs,
)
return result
Handling partial embedding cache hits:
# For embedding requests, the handler may return a partial result
cache_response = await caching_handler._async_get_cache(
model="text-embedding-ada-002",
original_function=litellm.aembedding,
logging_obj=logging_obj,
start_time=start_time,
call_type="aembedding",
kwargs={"input": ["hello", "world", "test"], "model": "text-embedding-ada-002"},
)
if cache_response.embedding_all_elements_cache_hit:
# All inputs were cached
return cache_response.final_embedding_cached_response
elif cache_response.final_embedding_cached_response is not None:
# Partial hit: kwargs["input"] has been rewritten to uncached inputs only
api_result = await litellm.aembedding(**kwargs)
# Combine cached + API results
combined = caching_handler._combine_cached_embedding_response_with_api_result(
cache_response, api_result, start_time, datetime.datetime.now()
)
return combined
Checking no-store directive:
# The handler respects per-request cache control directives
result = await litellm.acompletion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}],
cache={"no-store": True}, # Response will NOT be cached
)