Implementation:LMCache LMCache Prefix Analysis
| Knowledge Sources | |
|---|---|
| Domains | KV Cache Analysis, Performance Benchmarking |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
This module analyzes prefix cache hit rates across different pool sizes using both prefix matching and substring matching strategies on tokenized request traces.
Description
The prefix_analysis.py script is a standalone command-line tool in the LMCache examples that simulates KV cache pool behavior with LRU eviction. It loads request traces from JSONL files, tokenizes them using HuggingFace tokenizers, and computes cache hit rates for varying pool sizes using two methods: prefix matching and substring matching. The results are visualized as a comparison plot saved as a PNG file.
Usage
Use this script to evaluate how effective prefix-based and substring-based KV cache sharing strategies are for a given workload trace. It helps in capacity planning by showing hit rates across pool sizes ranging from 1 GB to unlimited.
Code Reference
Source Location
- Repository: LMCache
- File: examples/agents/prefix_analysis.py
- Lines: 1-471
Signature
class LRUTokenPool:
def __init__(self, max_tokens: float) -> None: ...
def longest_prefix_len(self, tokens: List[int]) -> Tuple[int, int]: ...
def longest_common_substring(
self, request_id: int, token_tensor: torch.Tensor, tokens: List[int],
*, chunk_len: int = 4, stride_r: int = 4, chunk_batch: int = 512,
) -> Tuple[int, float]: ...
def add_request(
self, request_id: int, tokens: List[int],
token_tensor: Optional[torch.Tensor] = None,
) -> None: ...
def load_and_tokenize_inputs(
jsonl_path: str, tokenizer_name: str = DEFAULT_TOKENIZER
) -> Tuple[List[List[int]], torch.Tensor]: ...
def calculate_hit_rate(
token_sequences: List[List[int]], pool_size: Optional[int] = None,
token_tensor: Optional[torch.Tensor] = None, method: str = "prefix",
) -> float: ...
def analyze_hit_rates_across_pool_sizes(
token_sequences: List[List[int]], pool_sizes_gb: List[Union[int, float, str]],
tokens_per_gb: int, token_tensor: Optional[torch.Tensor] = None,
) -> Tuple[List[float], List[float], List[str]]: ...
def plot_hit_rates(
prefix_hit_rates: List[float], substring_hit_rates: List[float],
x_labels: List[str], output_path: str,
) -> None: ...
def main() -> None: ...
Import
# This is a standalone script; run directly:
# python examples/agents/prefix_analysis.py -i trace.jsonl
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| -i / --input | str | Yes | Path to the input JSONL file containing request traces with an "input" field per line |
| -o / --output | str | No | Output path for the PNG plot (default: prefix_cache_hit_rate.png) |
| --tokenizer | str | No | HuggingFace tokenizer model name (default: meta-llama/Llama-3.1-8B) |
| --tokens-per-gb | int | No | Conversion factor from GB to tokens (default: 8200) |
| --pool-sizes | List[str] | No | Pool sizes in GB to test, space-separated, may include "unlimited" |
Outputs
| Name | Type | Description |
|---|---|---|
| PNG plot file | File | A comparison plot of prefix vs substring hit rates across pool sizes |
| Console output | stdout | Detailed hit rate statistics for each pool size and method |
Usage Examples
# Run from the command line:
# python examples/agents/prefix_analysis.py -i trace.jsonl -o results.png
# With custom pool sizes:
# python examples/agents/prefix_analysis.py -i trace.jsonl --pool-sizes 1 4 16 unlimited
# Programmatic usage of the LRUTokenPool:
from examples.agents.prefix_analysis import LRUTokenPool
pool = LRUTokenPool(max_tokens=100000)
pool.add_request(0, [1, 2, 3, 4, 5])
prefix_len, best_id = pool.longest_prefix_len([1, 2, 3, 6, 7])
# prefix_len == 3, matching the first 3 tokens