Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:BerriAI Litellm Response Caching

From Leeroopedia
Knowledge Sources
Domains LLM_Ops, Caching, Cost_Optimization
Last Updated 2026-02-15 16:00 GMT

Overview

End-to-end process for caching LLM API responses to reduce costs, latency, and redundant API calls using multiple backend strategies.

Description

This workflow covers the configuration and use of LiteLLM's caching system to store and retrieve LLM API responses. The system supports multiple cache backends (in-memory, Redis, S3, disk, Azure Blob, Google Cloud Storage) and semantic caching (Redis-based and Qdrant-based) that matches similar but not identical queries. Caching integrates transparently with the completion() API and the Router, requiring minimal code changes to enable significant cost savings.

Key outputs:

  • Cached responses returned instantly for repeated or similar queries
  • Configurable TTL (time-to-live) for cache entries
  • Semantic similarity matching for near-duplicate queries
  • Cache hit/miss metadata on responses

Usage

Execute this workflow when your application makes repeated or similar LLM calls and you want to reduce API costs and response latency. This is particularly valuable for applications with predictable query patterns, development/testing environments, or systems where identical prompts are sent frequently.

Execution Steps

Step 1: Cache Backend Selection

Choose the appropriate cache backend based on your deployment requirements. Options include in-memory (single process), Redis (distributed), S3 (persistent/archival), disk (local persistence), and semantic caches (Redis or Qdrant for similarity-based matching). Each backend has different trade-offs for latency, persistence, and scalability.

Key considerations:

  • In-memory cache is fastest but limited to single process and bounded by memory
  • Redis supports distributed caching across multiple instances with TTL
  • S3 and GCS provide durable storage for long-term caching
  • Semantic caches use embeddings to match similar queries, not just exact matches
  • Dual cache combines in-memory and Redis for two-tier performance

Step 2: Cache Initialization

Create a Cache instance with the chosen backend type and configuration parameters. For Redis, provide the host, port, and optional password. For S3, provide the bucket name and credentials. For semantic caches, configure the embedding model and similarity threshold. Assign the cache to litellm.cache to enable it globally.

Key considerations:

  • litellm.cache = Cache(type="redis", host="...", port=6379) enables global caching
  • Namespace parameter isolates cache entries between environments
  • TTL can be set globally or per-request
  • Supported params configuration controls which parameters affect cache key generation

Step 3: Cache Key Generation

When a completion call is made, the caching system generates a cache key by hashing the model name, messages, and relevant parameters. Only parameters listed in supported_call_params affect the key, ensuring that metadata changes do not invalidate cached responses.

Key considerations:

  • Cache keys are SHA-256 hashes of normalized request parameters
  • The messages content and model are always included in the key
  • Parameters like temperature, max_tokens, and tools affect the key
  • Custom cache keys can be provided via the cache_key metadata parameter

Step 4: Cache Lookup and Response

Before making an API call, the caching handler checks for a matching cached response. On a cache hit, the cached ModelResponse is returned immediately with a cache hit indicator in the metadata. On a cache miss, the API call proceeds normally and the response is stored in the cache for future use.

Key considerations:

  • Cache hits are indicated via response metadata headers
  • Cached responses include the original usage statistics
  • Cache miss triggers a normal API call followed by cache storage
  • Async cache operations are supported for non-blocking performance

Step 5: Cache Management

Manage the cache through flush operations, TTL adjustments, and monitoring. The cache supports flushing all entries or specific keys, updating TTL for existing entries, and monitoring hit rates. In proxy deployments, cache management endpoints provide API access to these operations.

Key considerations:

  • /cache/flush endpoint clears all cached entries in proxy mode
  • TTL can be overridden per-request via the ttl parameter
  • Cache size is bounded for in-memory backends to prevent memory exhaustion
  • Monitoring cache hit rates helps optimize TTL and caching strategy

Execution Diagram

GitHub URL

Workflow Repository