Heuristic:Kserve Kserve Prefix Cache Consistency
| Knowledge Sources | |
|---|---|
| Domains | LLM_Serving, Optimization |
| Last Updated | 2026-02-13 14:00 GMT |
Overview
PYTHONHASHSEED and block-size must be identical across all vLLM pods and the LLM scheduler for prefix cache routing to work correctly.
Description
Prefix caching allows vLLM to reuse KV cache blocks for repeated prompt prefixes, dramatically reducing time-to-first-token for shared system prompts. The scheduler routes requests to pods that already have the relevant prefix cached. This requires that the hash function used to identify prefix blocks produces identical results across all components.
Usage
Use this heuristic when configuring prefix cache-aware routing in LLMInferenceService deployments with the `precise-prefix-cache-scorer` scheduler plugin.
The Insight (Rule of Thumb)
- Action: Ensure three parameters are identical across ALL vLLM pods AND the LLM scheduler:
- `PYTHONHASHSEED` environment variable (e.g., "42")
- `--block-size` vLLM argument (e.g., 64)
- `--prefix-caching-hash-algo` (e.g., sha256_cbor_64bit)
- Value: Default block size is 16 tokens; use 64 for longer sequences with common prefixes.
- Trade-off: Larger blocks mean fewer cache entries but less granular prefix matching. Mismatch in ANY parameter causes complete cache misses even for identical prefixes.
Reasoning
The prefix cache hash is computed by hashing token blocks of a fixed size using Python's hash function (seeded by `PYTHONHASHSEED`). If any component uses a different seed, block size, or hash algorithm, the hash values will not match, and the scheduler cannot correctly identify which pod has a relevant cached prefix.
Evidence from `docs/samples/llmisvc/precise-prefix-kv-cache-routing/llm-inference-service-qwen2-7b-gpu-kv-cache-routing.yaml`:
# On vLLM pods:
env:
- name: PYTHONHASHSEED
value: "42"
args:
- --prefix-caching-hash-algo sha256_cbor_64bit
- --block-size 64
# On scheduler:
plugins:
- type: precise-prefix-cache-scorer
parameters:
indexerConfig:
tokenProcessorConfig:
blockSize: 64 # Must match vLLM --block-size
hashSeed: "42" # Must match PYTHONHASHSEED