Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:LMCache LMCache KV Cache Offloading

From Leeroopedia


Knowledge Sources
Domains LLM_Serving, KV_Cache, Inference_Optimization
Last Updated 2026-02-09 00:00 GMT

Overview

End-to-end process for offloading KV caches from GPU memory to CPU RAM or local disk within a single vLLM instance to reduce Time-To-First-Token (TTFT) for requests with shared prefixes.

Description

This workflow demonstrates the fundamental LMCache use case: KV cache reuse within a single vLLM serving instance. When multiple requests share common prompt prefixes (e.g., system prompts, document contexts in RAG), LMCache stores the computed KV cache tensors to CPU memory or local disk after the first request. Subsequent requests with the same prefix retrieve the cached tensors instead of recomputing them, dramatically reducing TTFT and GPU utilization. The process covers environment configuration, vLLM engine initialization with the LMCache connector, prompt construction with shared prefixes, and cache-accelerated generation.

Usage

Execute this workflow when you have a vLLM-based LLM serving deployment and want to reduce TTFT for workloads with repeated prompt prefixes, such as multi-round question-answering, RAG pipelines with shared document contexts, or chat sessions with long system prompts. This is the starting point for all LMCache deployments and requires only a single GPU.

Execution Steps

Step 1: Configure LMCache Environment

Set up LMCache parameters via environment variables or a YAML configuration file. Key parameters include the chunk size (number of tokens per cache chunk, typically 256), the storage backend selection (CPU RAM or local disk), and memory limits for each storage tier. For disk-based storage, specify a file URI path. These settings control how KV cache tensors are chunked, stored, and retrieved.

Key considerations:

  • Chunk size affects cache granularity -- smaller chunks enable finer-grained reuse but increase metadata overhead
  • CPU backend is faster but limited by available system RAM
  • Disk backend supports larger capacity but has higher latency
  • Memory limits prevent unbounded growth of cached data

Step 2: Initialize vLLM with LMCache Connector

Create a vLLM LLM engine instance with the LMCache KV transfer connector enabled. This involves constructing a KVTransferConfig that specifies the LMCacheConnectorV1 as the connector class and sets the role to "kv_both" (meaning the instance both stores and retrieves KV caches). Configure model parameters including maximum sequence length and GPU memory utilization based on available hardware.

Key considerations:

  • Use LMCacheConnectorV1 for vLLM v1 (the current recommended version)
  • The "kv_both" role enables bidirectional cache operations within a single instance
  • GPU memory utilization should leave headroom for both model weights and KV cache working memory
  • The connector integrates transparently with vLLM's inference pipeline

Step 3: Process Initial Request and Populate Cache

Send the first inference request containing the shared prefix text. During generation, LMCache intercepts the computed KV cache tensors from the GPU, chunks them according to the configured chunk size, and stores them asynchronously to the configured storage backend (CPU RAM or disk). The prefix token hashes serve as cache keys for later retrieval.

Key considerations:

  • The first request runs at normal speed with no cache benefit
  • KV cache storage happens asynchronously to minimize impact on generation latency
  • Token-level hashing ensures that identical token sequences produce the same cache keys regardless of the surrounding context
  • Cache population logs show the number of tokens stored for verification

Step 4: Serve Subsequent Requests with Cache Hits

When new requests arrive with the same prefix, LMCache's lookup client checks the token database for matching cache entries before prefill begins. Matched prefix chunks are retrieved from storage and loaded directly into the GPU KV cache, skipping recomputation. Only the novel suffix tokens require full prefill, resulting in significantly reduced TTFT.

Key considerations:

  • Cache hits are logged showing how many tokens were retrieved versus total tokens
  • The speedup is proportional to the fraction of the prompt that matches cached prefixes
  • Cache retrieval supports layerwise operations for memory efficiency
  • Multiple requests can benefit from the same cached prefix simultaneously

Step 5: Clean Up Resources

After serving is complete, destroy the LMCache engine instance to release CPU/disk resources and clean up any temporary storage. This ensures proper resource management in production deployments.

Key considerations:

  • Use LMCacheEngineBuilder.destroy() for explicit cleanup
  • Context managers in the example code handle cleanup automatically on exit
  • Disk-based caches can optionally be preserved across restarts for persistent caching

Execution Diagram

GitHub URL

Workflow Repository