Implementation:LMCache LMCache Prefix Analysis

Knowledge Sources	LMCache
Domains	KV Cache Analysis, Performance Benchmarking
Last Updated	2026-02-09 00:00 GMT

Overview

This module analyzes prefix cache hit rates across different pool sizes using both prefix matching and substring matching strategies on tokenized request traces.

Description

The prefix_analysis.py script is a standalone command-line tool in the LMCache examples that simulates KV cache pool behavior with LRU eviction. It loads request traces from JSONL files, tokenizes them using HuggingFace tokenizers, and computes cache hit rates for varying pool sizes using two methods: prefix matching and substring matching. The results are visualized as a comparison plot saved as a PNG file.

Usage

Use this script to evaluate how effective prefix-based and substring-based KV cache sharing strategies are for a given workload trace. It helps in capacity planning by showing hit rates across pool sizes ranging from 1 GB to unlimited.

Code Reference

Source Location

Repository: LMCache
File: examples/agents/prefix_analysis.py
Lines: 1-471

Signature

class LRUTokenPool:
    def __init__(self, max_tokens: float) -> None: ...
    def longest_prefix_len(self, tokens: List[int]) -> Tuple[int, int]: ...
    def longest_common_substring(
        self, request_id: int, token_tensor: torch.Tensor, tokens: List[int],
        *, chunk_len: int = 4, stride_r: int = 4, chunk_batch: int = 512,
    ) -> Tuple[int, float]: ...
    def add_request(
        self, request_id: int, tokens: List[int],
        token_tensor: Optional[torch.Tensor] = None,
    ) -> None: ...

def load_and_tokenize_inputs(
    jsonl_path: str, tokenizer_name: str = DEFAULT_TOKENIZER
) -> Tuple[List[List[int]], torch.Tensor]: ...

def calculate_hit_rate(
    token_sequences: List[List[int]], pool_size: Optional[int] = None,
    token_tensor: Optional[torch.Tensor] = None, method: str = "prefix",
) -> float: ...

def analyze_hit_rates_across_pool_sizes(
    token_sequences: List[List[int]], pool_sizes_gb: List[Union[int, float, str]],
    tokens_per_gb: int, token_tensor: Optional[torch.Tensor] = None,
) -> Tuple[List[float], List[float], List[str]]: ...

def plot_hit_rates(
    prefix_hit_rates: List[float], substring_hit_rates: List[float],
    x_labels: List[str], output_path: str,
) -> None: ...

def main() -> None: ...

Import

# This is a standalone script; run directly:
# python examples/agents/prefix_analysis.py -i trace.jsonl

I/O Contract

Inputs

Name	Type	Required	Description
-i / --input	str	Yes	Path to the input JSONL file containing request traces with an "input" field per line
-o / --output	str	No	Output path for the PNG plot (default: prefix_cache_hit_rate.png)
--tokenizer	str	No	HuggingFace tokenizer model name (default: meta-llama/Llama-3.1-8B)
--tokens-per-gb	int	No	Conversion factor from GB to tokens (default: 8200)
--pool-sizes	List[str]	No	Pool sizes in GB to test, space-separated, may include "unlimited"

Outputs

Name	Type	Description
PNG plot file	File	A comparison plot of prefix vs substring hit rates across pool sizes
Console output	stdout	Detailed hit rate statistics for each pool size and method

Usage Examples

# Run from the command line:
# python examples/agents/prefix_analysis.py -i trace.jsonl -o results.png

# With custom pool sizes:
# python examples/agents/prefix_analysis.py -i trace.jsonl --pool-sizes 1 4 16 unlimited

# Programmatic usage of the LRUTokenPool:
from examples.agents.prefix_analysis import LRUTokenPool

pool = LRUTokenPool(max_tokens=100000)
pool.add_request(0, [1, 2, 3, 4, 5])
prefix_len, best_id = pool.longest_prefix_len([1, 2, 3, 6, 7])
# prefix_len == 3, matching the first 3 tokens

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment