Principle:BerriAI Litellm Cache Backend Selection
| Knowledge Sources | Software architecture best practices; distributed systems caching patterns; LLM API gateway design |
|---|---|
| Domains | Caching, Distributed Systems, LLM Infrastructure |
| Last Updated | 2026-02-15 |
Overview
Cache backend selection is the design decision of choosing the most appropriate storage engine for caching responses based on deployment topology, latency requirements, and feature needs.
Description
When building systems that cache LLM responses, a single caching strategy rarely fits all deployment scenarios. Cache backend selection addresses the problem of matching operational requirements to a storage engine's capabilities. The key factors in backend selection include:
- Deployment scope: A single-process application benefits from in-memory (local) caching with zero network overhead, while a multi-instance deployment requires a shared, network-accessible store such as Redis.
- Durability and persistence: Ephemeral in-memory caches lose data on process restart. Object-store backends (S3, GCS, Azure Blob) offer durable, long-lived caches at the cost of higher latency.
- Semantic similarity: Traditional exact-match caching misses semantically equivalent but textually different requests. Semantic cache backends use vector embeddings and similarity thresholds to match requests that are meaning-equivalent rather than string-identical.
- Cost and operational complexity: Managed cloud storage (S3, GCS) minimises operational burden; self-hosted Redis demands infrastructure management but delivers sub-millisecond lookups.
A well-designed system exposes backend selection through a single, enumerated type so that the rest of the caching pipeline remains backend-agnostic.
Usage
Use cache backend selection when:
- You are deploying an LLM gateway or proxy that serves multiple downstream consumers and need to decide where cached responses are stored.
- You need to switch between development (local, in-memory) and production (Redis, S3) configurations without changing application logic.
- You want to enable semantic caching to improve hit rates for paraphrased prompts.
- You need to comply with data-residency requirements that dictate where cached data may reside (e.g., a specific cloud region or on-disk only).
Theoretical Basis
Cache backend selection follows the Strategy pattern, where the caching subsystem delegates storage operations to an interchangeable backend object. The client code interacts with a uniform interface while the concrete strategy (local, Redis, S3, etc.) handles the actual storage.
Pseudocode:
ENUM CacheBackend:
LOCAL -- in-process memory store
REDIS -- networked key-value store
REDIS_SEMANTIC -- Redis with vector similarity search
S3 -- object storage (AWS)
GCS -- object storage (Google Cloud)
AZURE_BLOB -- object storage (Azure)
DISK -- local filesystem
QDRANT_SEMANTIC -- vector database with similarity search
FUNCTION select_backend(config) -> CacheStore:
MATCH config.type:
LOCAL -> return InMemoryStore()
REDIS -> return RedisStore(config.host, config.port, config.password)
REDIS_SEMANTIC -> return RedisSemanticStore(config.host, config.embedding_model, config.threshold)
S3 -> return S3Store(config.bucket, config.region)
GCS -> return GCSStore(config.bucket)
AZURE_BLOB -> return AzureBlobStore(config.account_url, config.container)
DISK -> return DiskStore(config.directory)
QDRANT_SEMANTIC -> return QdrantStore(config.api_base, config.collection, config.threshold)
DEFAULT -> raise UnsupportedBackendError
The key design properties are:
- Single point of configuration: The backend type is specified once; all downstream code is polymorphic over the chosen backend.
- Open/Closed principle: New backends can be added by extending the enum and providing a new concrete store implementation without modifying existing code paths.
- Separation of concerns: Cache key generation, TTL management, and lookup logic are independent of the storage medium.
When evaluating backends, the primary trade-offs are:
| Backend Category | Latency | Shared Across Processes | Persistence | Semantic Matching |
|---|---|---|---|---|
| In-Memory | Sub-microsecond | No | No | No |
| Redis | Sub-millisecond | Yes | Optional | No |
| Redis/Qdrant Semantic | Milliseconds | Yes | Optional | Yes |
| Object Store (S3/GCS/Azure) | Tens of milliseconds | Yes | Yes | No |
| Disk | Milliseconds | No | Yes | No |