Heuristic:Kserve Kserve Prefix Cache Consistency

Knowledge Sources	KServe Prefix KV Cache Routing
Domains	LLM_Serving, Optimization
Last Updated	2026-02-13 14:00 GMT

Overview

PYTHONHASHSEED and block-size must be identical across all vLLM pods and the LLM scheduler for prefix cache routing to work correctly.

Description

Prefix caching allows vLLM to reuse KV cache blocks for repeated prompt prefixes, dramatically reducing time-to-first-token for shared system prompts. The scheduler routes requests to pods that already have the relevant prefix cached. This requires that the hash function used to identify prefix blocks produces identical results across all components.

Usage

Use this heuristic when configuring prefix cache-aware routing in LLMInferenceService deployments with the `precise-prefix-cache-scorer` scheduler plugin.

The Insight (Rule of Thumb)

Action: Ensure three parameters are identical across ALL vLLM pods AND the LLM scheduler:
- `PYTHONHASHSEED` environment variable (e.g., "42")
- `--block-size` vLLM argument (e.g., 64)
- `--prefix-caching-hash-algo` (e.g., sha256_cbor_64bit)
Value: Default block size is 16 tokens; use 64 for longer sequences with common prefixes.
Trade-off: Larger blocks mean fewer cache entries but less granular prefix matching. Mismatch in ANY parameter causes complete cache misses even for identical prefixes.

Reasoning

The prefix cache hash is computed by hashing token blocks of a fixed size using Python's hash function (seeded by `PYTHONHASHSEED`). If any component uses a different seed, block size, or hash algorithm, the hash values will not match, and the scheduler cannot correctly identify which pod has a relevant cached prefix.

Evidence from `docs/samples/llmisvc/precise-prefix-kv-cache-routing/llm-inference-service-qwen2-7b-gpu-kv-cache-routing.yaml`:

# On vLLM pods:
env:
  - name: PYTHONHASHSEED
    value: "42"
args:
  - --prefix-caching-hash-algo sha256_cbor_64bit
  - --block-size 64

# On scheduler:
plugins:
  - type: precise-prefix-cache-scorer
    parameters:
      indexerConfig:
        tokenProcessorConfig:
          blockSize: 64       # Must match vLLM --block-size
          hashSeed: "42"      # Must match PYTHONHASHSEED

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment