Workflow:BerriAI Litellm Response Caching
| Knowledge Sources | |
|---|---|
| Domains | LLM_Ops, Caching, Cost_Optimization |
| Last Updated | 2026-02-15 16:00 GMT |
Overview
End-to-end process for caching LLM API responses to reduce costs, latency, and redundant API calls using multiple backend strategies.
Description
This workflow covers the configuration and use of LiteLLM's caching system to store and retrieve LLM API responses. The system supports multiple cache backends (in-memory, Redis, S3, disk, Azure Blob, Google Cloud Storage) and semantic caching (Redis-based and Qdrant-based) that matches similar but not identical queries. Caching integrates transparently with the completion() API and the Router, requiring minimal code changes to enable significant cost savings.
Key outputs:
- Cached responses returned instantly for repeated or similar queries
- Configurable TTL (time-to-live) for cache entries
- Semantic similarity matching for near-duplicate queries
- Cache hit/miss metadata on responses
Usage
Execute this workflow when your application makes repeated or similar LLM calls and you want to reduce API costs and response latency. This is particularly valuable for applications with predictable query patterns, development/testing environments, or systems where identical prompts are sent frequently.
Execution Steps
Step 1: Cache Backend Selection
Choose the appropriate cache backend based on your deployment requirements. Options include in-memory (single process), Redis (distributed), S3 (persistent/archival), disk (local persistence), and semantic caches (Redis or Qdrant for similarity-based matching). Each backend has different trade-offs for latency, persistence, and scalability.
Key considerations:
- In-memory cache is fastest but limited to single process and bounded by memory
- Redis supports distributed caching across multiple instances with TTL
- S3 and GCS provide durable storage for long-term caching
- Semantic caches use embeddings to match similar queries, not just exact matches
- Dual cache combines in-memory and Redis for two-tier performance
Step 2: Cache Initialization
Create a Cache instance with the chosen backend type and configuration parameters. For Redis, provide the host, port, and optional password. For S3, provide the bucket name and credentials. For semantic caches, configure the embedding model and similarity threshold. Assign the cache to litellm.cache to enable it globally.
Key considerations:
litellm.cache = Cache(type="redis", host="...", port=6379)enables global caching- Namespace parameter isolates cache entries between environments
- TTL can be set globally or per-request
- Supported params configuration controls which parameters affect cache key generation
Step 3: Cache Key Generation
When a completion call is made, the caching system generates a cache key by hashing the model name, messages, and relevant parameters. Only parameters listed in supported_call_params affect the key, ensuring that metadata changes do not invalidate cached responses.
Key considerations:
- Cache keys are SHA-256 hashes of normalized request parameters
- The
messagescontent andmodelare always included in the key - Parameters like
temperature,max_tokens, andtoolsaffect the key - Custom cache keys can be provided via the
cache_keymetadata parameter
Step 4: Cache Lookup and Response
Before making an API call, the caching handler checks for a matching cached response. On a cache hit, the cached ModelResponse is returned immediately with a cache hit indicator in the metadata. On a cache miss, the API call proceeds normally and the response is stored in the cache for future use.
Key considerations:
- Cache hits are indicated via response metadata headers
- Cached responses include the original usage statistics
- Cache miss triggers a normal API call followed by cache storage
- Async cache operations are supported for non-blocking performance
Step 5: Cache Management
Manage the cache through flush operations, TTL adjustments, and monitoring. The cache supports flushing all entries or specific keys, updating TTL for existing entries, and monitoring hit rates. In proxy deployments, cache management endpoints provide API access to these operations.
Key considerations:
/cache/flushendpoint clears all cached entries in proxy mode- TTL can be overridden per-request via the
ttlparameter - Cache size is bounded for in-memory backends to prevent memory exhaustion
- Monitoring cache hit rates helps optimize TTL and caching strategy