Principle:BerriAI Litellm Cache Backend Selection

Knowledge Sources	Software architecture best practices; distributed systems caching patterns; LLM API gateway design
Domains	Caching, Distributed Systems, LLM Infrastructure
Last Updated	2026-02-15

Overview

Cache backend selection is the design decision of choosing the most appropriate storage engine for caching responses based on deployment topology, latency requirements, and feature needs.

Description

When building systems that cache LLM responses, a single caching strategy rarely fits all deployment scenarios. Cache backend selection addresses the problem of matching operational requirements to a storage engine's capabilities. The key factors in backend selection include:

Deployment scope: A single-process application benefits from in-memory (local) caching with zero network overhead, while a multi-instance deployment requires a shared, network-accessible store such as Redis.
Durability and persistence: Ephemeral in-memory caches lose data on process restart. Object-store backends (S3, GCS, Azure Blob) offer durable, long-lived caches at the cost of higher latency.
Semantic similarity: Traditional exact-match caching misses semantically equivalent but textually different requests. Semantic cache backends use vector embeddings and similarity thresholds to match requests that are meaning-equivalent rather than string-identical.
Cost and operational complexity: Managed cloud storage (S3, GCS) minimises operational burden; self-hosted Redis demands infrastructure management but delivers sub-millisecond lookups.

A well-designed system exposes backend selection through a single, enumerated type so that the rest of the caching pipeline remains backend-agnostic.

Usage

Use cache backend selection when:

You are deploying an LLM gateway or proxy that serves multiple downstream consumers and need to decide where cached responses are stored.
You need to switch between development (local, in-memory) and production (Redis, S3) configurations without changing application logic.
You want to enable semantic caching to improve hit rates for paraphrased prompts.
You need to comply with data-residency requirements that dictate where cached data may reside (e.g., a specific cloud region or on-disk only).

Theoretical Basis

Cache backend selection follows the Strategy pattern, where the caching subsystem delegates storage operations to an interchangeable backend object. The client code interacts with a uniform interface while the concrete strategy (local, Redis, S3, etc.) handles the actual storage.

Pseudocode:

ENUM CacheBackend:
    LOCAL          -- in-process memory store
    REDIS          -- networked key-value store
    REDIS_SEMANTIC -- Redis with vector similarity search
    S3             -- object storage (AWS)
    GCS            -- object storage (Google Cloud)
    AZURE_BLOB     -- object storage (Azure)
    DISK           -- local filesystem
    QDRANT_SEMANTIC -- vector database with similarity search

FUNCTION select_backend(config) -> CacheStore:
    MATCH config.type:
        LOCAL          -> return InMemoryStore()
        REDIS          -> return RedisStore(config.host, config.port, config.password)
        REDIS_SEMANTIC -> return RedisSemanticStore(config.host, config.embedding_model, config.threshold)
        S3             -> return S3Store(config.bucket, config.region)
        GCS            -> return GCSStore(config.bucket)
        AZURE_BLOB     -> return AzureBlobStore(config.account_url, config.container)
        DISK           -> return DiskStore(config.directory)
        QDRANT_SEMANTIC -> return QdrantStore(config.api_base, config.collection, config.threshold)
        DEFAULT        -> raise UnsupportedBackendError

The key design properties are:

Single point of configuration: The backend type is specified once; all downstream code is polymorphic over the chosen backend.
Open/Closed principle: New backends can be added by extending the enum and providing a new concrete store implementation without modifying existing code paths.
Separation of concerns: Cache key generation, TTL management, and lookup logic are independent of the storage medium.

When evaluating backends, the primary trade-offs are:

Backend Category	Latency	Shared Across Processes	Persistence	Semantic Matching
In-Memory	Sub-microsecond	No	No	No
Redis	Sub-millisecond	Yes	Optional	No
Redis/Qdrant Semantic	Milliseconds	Yes	Optional	Yes
Object Store (S3/GCS/Azure)	Tens of milliseconds	Yes	Yes	No
Disk	Milliseconds	No	Yes	No

Related Pages

Implementation:BerriAI_Litellm_Cache_Type_Enum

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment