Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:LMCache LMCache Prefix Analysis

From Leeroopedia


Knowledge Sources
Domains KV Cache Analysis, Performance Benchmarking
Last Updated 2026-02-09 00:00 GMT

Overview

This module analyzes prefix cache hit rates across different pool sizes using both prefix matching and substring matching strategies on tokenized request traces.

Description

The prefix_analysis.py script is a standalone command-line tool in the LMCache examples that simulates KV cache pool behavior with LRU eviction. It loads request traces from JSONL files, tokenizes them using HuggingFace tokenizers, and computes cache hit rates for varying pool sizes using two methods: prefix matching and substring matching. The results are visualized as a comparison plot saved as a PNG file.

Usage

Use this script to evaluate how effective prefix-based and substring-based KV cache sharing strategies are for a given workload trace. It helps in capacity planning by showing hit rates across pool sizes ranging from 1 GB to unlimited.

Code Reference

Source Location

Signature

class LRUTokenPool:
    def __init__(self, max_tokens: float) -> None: ...
    def longest_prefix_len(self, tokens: List[int]) -> Tuple[int, int]: ...
    def longest_common_substring(
        self, request_id: int, token_tensor: torch.Tensor, tokens: List[int],
        *, chunk_len: int = 4, stride_r: int = 4, chunk_batch: int = 512,
    ) -> Tuple[int, float]: ...
    def add_request(
        self, request_id: int, tokens: List[int],
        token_tensor: Optional[torch.Tensor] = None,
    ) -> None: ...

def load_and_tokenize_inputs(
    jsonl_path: str, tokenizer_name: str = DEFAULT_TOKENIZER
) -> Tuple[List[List[int]], torch.Tensor]: ...

def calculate_hit_rate(
    token_sequences: List[List[int]], pool_size: Optional[int] = None,
    token_tensor: Optional[torch.Tensor] = None, method: str = "prefix",
) -> float: ...

def analyze_hit_rates_across_pool_sizes(
    token_sequences: List[List[int]], pool_sizes_gb: List[Union[int, float, str]],
    tokens_per_gb: int, token_tensor: Optional[torch.Tensor] = None,
) -> Tuple[List[float], List[float], List[str]]: ...

def plot_hit_rates(
    prefix_hit_rates: List[float], substring_hit_rates: List[float],
    x_labels: List[str], output_path: str,
) -> None: ...

def main() -> None: ...

Import

# This is a standalone script; run directly:
# python examples/agents/prefix_analysis.py -i trace.jsonl

I/O Contract

Inputs

Name Type Required Description
-i / --input str Yes Path to the input JSONL file containing request traces with an "input" field per line
-o / --output str No Output path for the PNG plot (default: prefix_cache_hit_rate.png)
--tokenizer str No HuggingFace tokenizer model name (default: meta-llama/Llama-3.1-8B)
--tokens-per-gb int No Conversion factor from GB to tokens (default: 8200)
--pool-sizes List[str] No Pool sizes in GB to test, space-separated, may include "unlimited"

Outputs

Name Type Description
PNG plot file File A comparison plot of prefix vs substring hit rates across pool sizes
Console output stdout Detailed hit rate statistics for each pool size and method

Usage Examples

# Run from the command line:
# python examples/agents/prefix_analysis.py -i trace.jsonl -o results.png

# With custom pool sizes:
# python examples/agents/prefix_analysis.py -i trace.jsonl --pool-sizes 1 4 16 unlimited

# Programmatic usage of the LRUTokenPool:
from examples.agents.prefix_analysis import LRUTokenPool

pool = LRUTokenPool(max_tokens=100000)
pool.add_request(0, [1, 2, 3, 4, 5])
prefix_len, best_id = pool.longest_prefix_len([1, 2, 3, 6, 7])
# prefix_len == 3, matching the first 3 tokens

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment