Implementation:BerriAI Litellm Caching Handler

Knowledge Sources	https://github.com/BerriAI/litellm
Domains	Caching, LLM Infrastructure, Asynchronous Programming
Last Updated	2026-02-15

Overview

Concrete handler for looking up cached LLM responses and storing new responses provided by the LLMCachingHandler class.

Description

The LLMCachingHandler class in litellm/caching/caching_handler.py is the wrapper that sits between the LLM API call lifecycle and the underlying cache backends. It exposes two primary methods:

_async_get_cache: Checks whether caching is enabled and the no-cache directive is absent, then calls _retrieve_from_cache to query the backend. For non-embedding call types, a cache hit triggers type conversion (via _convert_cached_result_to_model_response), callback logging, and returns a CachingHandlerResponse. For embedding call types, it supports partial cache hits: each input in the embedding list is checked individually (in parallel using asyncio.gather), and the results are split into cached and uncached partitions.

async_set_cache: After a successful LLM API response, this method checks the no-store directive and call type eligibility, then asynchronously writes the result to the cache. EmbeddingResponse objects use the bulk pipeline write (async_add_cache_pipeline) for efficiency, while other response types are serialized via model_dump_json() and written individually. All writes are dispatched as non-blocking asyncio.create_task calls.

The class also provides synchronous counterparts (_sync_get_cache, sync_set_cache) for non-async code paths. A supporting CachingHandlerResponse Pydantic model is used to return both the cached result and partial embedding state.

Usage

The LLMCachingHandler is instantiated internally by the LiteLLM core for each LLM API call. It is not typically imported directly by end users but is critical for understanding the caching lifecycle. It is instantiated with the original function, request kwargs, and start time, and is used before and after the API call.

Code Reference

Source Location	`litellm/caching/caching_handler.py`, lines 89-902
Signature (_async_get_cache)	`LLMCachingHandler._async_get_cache(self, model: str, original_function: Callable, logging_obj: LiteLLMLoggingObj, start_time: datetime.datetime, call_type: str, kwargs: Dict[str, Any], args: Optional[Tuple[Any, ...]] = None) -> Optional[CachingHandlerResponse]`
Signature (async_set_cache)	`LLMCachingHandler.async_set_cache(self, result: Any, original_function: Callable, kwargs: Dict[str, Any], args: Optional[Tuple[Any, ...]] = None) -> None`
Import	`from litellm.caching.caching_handler import LLMCachingHandler`

Constructor:

class LLMCachingHandler:
    def __init__(
        self,
        original_function: Callable,
        request_kwargs: Dict[str, Any],
        start_time: datetime.datetime,
    ):
        self.async_streaming_chunks: List[ModelResponse] = []
        self.sync_streaming_chunks: List[ModelResponse] = []
        self.request_kwargs = request_kwargs
        self.original_function = original_function
        self.start_time = start_time
        if litellm.cache is not None and isinstance(litellm.cache.cache, RedisCache):
            self.dual_cache: Optional[DualCache] = DualCache(
                redis_cache=litellm.cache.cache,
                in_memory_cache=in_memory_cache_obj,
            )
        else:
            self.dual_cache = None

CachingHandlerResponse:

class CachingHandlerResponse(BaseModel):
    cached_result: Optional[Any] = None
    final_embedding_cached_response: Optional[EmbeddingResponse] = None
    embedding_all_elements_cache_hit: bool = False

I/O Contract

Inputs for _async_get_cache:

Parameter	Type	Description
`model`	`str`	The model identifier for the LLM request
`original_function`	`Callable`	The original LLM API function (e.g., `acompletion`, `aembedding`)
`logging_obj`	`LiteLLMLoggingObj`	The logging/callback manager for this request
`start_time`	`datetime.datetime`	The timestamp when the request started
`call_type`	`str`	The type of call (e.g., `"acompletion"`, `"aembedding"`)
`kwargs`	`Dict[str, Any]`	The full set of keyword arguments for the LLM call
`args`	`Optional[Tuple[Any, ...]]`	Optional positional arguments

Outputs for _async_get_cache:

Return Type	Description
`Optional[CachingHandlerResponse]`	Returns `None` if caching is disabled. Returns a `CachingHandlerResponse` with `cached_result` on hit (or `None` on miss), `final_embedding_cached_response` for partial embedding hits, and `embedding_all_elements_cache_hit` flag.

Inputs for async_set_cache:

Parameter	Type	Description
`result`	`Any`	The LLM API response (`ModelResponse`, `EmbeddingResponse`, `TranscriptionResponse`, `RerankResponse`, `ResponsesAPIResponse`, or raw result)
`original_function`	`Callable`	The original LLM API function
`kwargs`	`Dict[str, Any]`	The full set of keyword arguments for the LLM call
`args`	`Optional[Tuple[Any, ...]]`	Optional positional arguments

Outputs for async_set_cache:

Return Type	Description
`None`	The method dispatches cache writes as non-blocking `asyncio.create_task` fire-and-forget operations and returns immediately.

Usage Examples

Typical lifecycle within the LiteLLM core (simplified):

import datetime
from litellm.caching.caching_handler import LLMCachingHandler

# 1. Instantiate the handler at the start of an LLM call
caching_handler = LLMCachingHandler(
    original_function=litellm.acompletion,
    request_kwargs=kwargs,
    start_time=datetime.datetime.now(),
)

# 2. Check cache before calling the provider
cache_response = await caching_handler._async_get_cache(
    model="gpt-4",
    original_function=litellm.acompletion,
    logging_obj=logging_obj,
    start_time=start_time,
    call_type="acompletion",
    kwargs=kwargs,
)

if cache_response is not None and cache_response.cached_result is not None:
    # Cache HIT -- return the cached result directly
    return cache_response.cached_result

# 3. Cache MISS -- call the LLM provider
result = await litellm.acompletion(**kwargs)

# 4. Store the result in the cache for future requests
await caching_handler.async_set_cache(
    result=result,
    original_function=litellm.acompletion,
    kwargs=kwargs,
)

return result

Handling partial embedding cache hits:

# For embedding requests, the handler may return a partial result
cache_response = await caching_handler._async_get_cache(
    model="text-embedding-ada-002",
    original_function=litellm.aembedding,
    logging_obj=logging_obj,
    start_time=start_time,
    call_type="aembedding",
    kwargs={"input": ["hello", "world", "test"], "model": "text-embedding-ada-002"},
)

if cache_response.embedding_all_elements_cache_hit:
    # All inputs were cached
    return cache_response.final_embedding_cached_response
elif cache_response.final_embedding_cached_response is not None:
    # Partial hit: kwargs["input"] has been rewritten to uncached inputs only
    api_result = await litellm.aembedding(**kwargs)
    # Combine cached + API results
    combined = caching_handler._combine_cached_embedding_response_with_api_result(
        cache_response, api_result, start_time, datetime.datetime.now()
    )
    return combined

Checking no-store directive:

# The handler respects per-request cache control directives
result = await litellm.acompletion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}],
    cache={"no-store": True},  # Response will NOT be cached
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment