Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:BerriAI Litellm Caching Handler

From Leeroopedia
Knowledge Sources https://github.com/BerriAI/litellm
Domains Caching, LLM Infrastructure, Asynchronous Programming
Last Updated 2026-02-15

Overview

Concrete handler for looking up cached LLM responses and storing new responses provided by the LLMCachingHandler class.

Description

The LLMCachingHandler class in litellm/caching/caching_handler.py is the wrapper that sits between the LLM API call lifecycle and the underlying cache backends. It exposes two primary methods:

  • _async_get_cache: Checks whether caching is enabled and the no-cache directive is absent, then calls _retrieve_from_cache to query the backend. For non-embedding call types, a cache hit triggers type conversion (via _convert_cached_result_to_model_response), callback logging, and returns a CachingHandlerResponse. For embedding call types, it supports partial cache hits: each input in the embedding list is checked individually (in parallel using asyncio.gather), and the results are split into cached and uncached partitions.
  • async_set_cache: After a successful LLM API response, this method checks the no-store directive and call type eligibility, then asynchronously writes the result to the cache. EmbeddingResponse objects use the bulk pipeline write (async_add_cache_pipeline) for efficiency, while other response types are serialized via model_dump_json() and written individually. All writes are dispatched as non-blocking asyncio.create_task calls.

The class also provides synchronous counterparts (_sync_get_cache, sync_set_cache) for non-async code paths. A supporting CachingHandlerResponse Pydantic model is used to return both the cached result and partial embedding state.

Usage

The LLMCachingHandler is instantiated internally by the LiteLLM core for each LLM API call. It is not typically imported directly by end users but is critical for understanding the caching lifecycle. It is instantiated with the original function, request kwargs, and start time, and is used before and after the API call.

Code Reference

Source Location litellm/caching/caching_handler.py, lines 89-902
Signature (_async_get_cache) LLMCachingHandler._async_get_cache(self, model: str, original_function: Callable, logging_obj: LiteLLMLoggingObj, start_time: datetime.datetime, call_type: str, kwargs: Dict[str, Any], args: Optional[Tuple[Any, ...]] = None) -> Optional[CachingHandlerResponse]
Signature (async_set_cache) LLMCachingHandler.async_set_cache(self, result: Any, original_function: Callable, kwargs: Dict[str, Any], args: Optional[Tuple[Any, ...]] = None) -> None
Import from litellm.caching.caching_handler import LLMCachingHandler

Constructor:

class LLMCachingHandler:
    def __init__(
        self,
        original_function: Callable,
        request_kwargs: Dict[str, Any],
        start_time: datetime.datetime,
    ):
        self.async_streaming_chunks: List[ModelResponse] = []
        self.sync_streaming_chunks: List[ModelResponse] = []
        self.request_kwargs = request_kwargs
        self.original_function = original_function
        self.start_time = start_time
        if litellm.cache is not None and isinstance(litellm.cache.cache, RedisCache):
            self.dual_cache: Optional[DualCache] = DualCache(
                redis_cache=litellm.cache.cache,
                in_memory_cache=in_memory_cache_obj,
            )
        else:
            self.dual_cache = None

CachingHandlerResponse:

class CachingHandlerResponse(BaseModel):
    cached_result: Optional[Any] = None
    final_embedding_cached_response: Optional[EmbeddingResponse] = None
    embedding_all_elements_cache_hit: bool = False

I/O Contract

Inputs for _async_get_cache:

Parameter Type Description
model str The model identifier for the LLM request
original_function Callable The original LLM API function (e.g., acompletion, aembedding)
logging_obj LiteLLMLoggingObj The logging/callback manager for this request
start_time datetime.datetime The timestamp when the request started
call_type str The type of call (e.g., "acompletion", "aembedding")
kwargs Dict[str, Any] The full set of keyword arguments for the LLM call
args Optional[Tuple[Any, ...]] Optional positional arguments

Outputs for _async_get_cache:

Return Type Description
Optional[CachingHandlerResponse] Returns None if caching is disabled. Returns a CachingHandlerResponse with cached_result on hit (or None on miss), final_embedding_cached_response for partial embedding hits, and embedding_all_elements_cache_hit flag.

Inputs for async_set_cache:

Parameter Type Description
result Any The LLM API response (ModelResponse, EmbeddingResponse, TranscriptionResponse, RerankResponse, ResponsesAPIResponse, or raw result)
original_function Callable The original LLM API function
kwargs Dict[str, Any] The full set of keyword arguments for the LLM call
args Optional[Tuple[Any, ...]] Optional positional arguments

Outputs for async_set_cache:

Return Type Description
None The method dispatches cache writes as non-blocking asyncio.create_task fire-and-forget operations and returns immediately.

Usage Examples

Typical lifecycle within the LiteLLM core (simplified):

import datetime
from litellm.caching.caching_handler import LLMCachingHandler

# 1. Instantiate the handler at the start of an LLM call
caching_handler = LLMCachingHandler(
    original_function=litellm.acompletion,
    request_kwargs=kwargs,
    start_time=datetime.datetime.now(),
)

# 2. Check cache before calling the provider
cache_response = await caching_handler._async_get_cache(
    model="gpt-4",
    original_function=litellm.acompletion,
    logging_obj=logging_obj,
    start_time=start_time,
    call_type="acompletion",
    kwargs=kwargs,
)

if cache_response is not None and cache_response.cached_result is not None:
    # Cache HIT -- return the cached result directly
    return cache_response.cached_result

# 3. Cache MISS -- call the LLM provider
result = await litellm.acompletion(**kwargs)

# 4. Store the result in the cache for future requests
await caching_handler.async_set_cache(
    result=result,
    original_function=litellm.acompletion,
    kwargs=kwargs,
)

return result

Handling partial embedding cache hits:

# For embedding requests, the handler may return a partial result
cache_response = await caching_handler._async_get_cache(
    model="text-embedding-ada-002",
    original_function=litellm.aembedding,
    logging_obj=logging_obj,
    start_time=start_time,
    call_type="aembedding",
    kwargs={"input": ["hello", "world", "test"], "model": "text-embedding-ada-002"},
)

if cache_response.embedding_all_elements_cache_hit:
    # All inputs were cached
    return cache_response.final_embedding_cached_response
elif cache_response.final_embedding_cached_response is not None:
    # Partial hit: kwargs["input"] has been rewritten to uncached inputs only
    api_result = await litellm.aembedding(**kwargs)
    # Combine cached + API results
    combined = caching_handler._combine_cached_embedding_response_with_api_result(
        cache_response, api_result, start_time, datetime.datetime.now()
    )
    return combined

Checking no-store directive:

# The handler respects per-request cache control directives
result = await litellm.acompletion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}],
    cache={"no-store": True},  # Response will NOT be cached
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment