Heuristic:LMCache LMCache Memory Fragmentation Budget

Knowledge Sources	LMCache
Domains	Optimization, Memory_Management
Last Updated	2026-02-09 00:00 GMT

Overview

Memory safety heuristic that reserves 50% of the chunk budget for concurrent operations, based on the proven upper bound that fragmentation cannot exceed 50% when all chunks are uniformly sized.

Description

The `WeightedSemaphore` in the `StorageManager` allocates only half of the total chunk budget for concurrent retrieval operations. This design is rooted in the observation that when `save_unfull_chunk=False` (i.e., all stored chunks are the same size), memory fragmentation physically cannot exceed 50%. The remaining 50% acts as a safety margin for fragmentation and allocation overhead. When an operation requires more than the concurrent budget (oversized request), the semaphore switches to exclusive access mode, waiting until all concurrent slots are freed before proceeding.

Usage

This heuristic applies when tuning CPU memory allocation for LMCache's local CPU backend. If you are experiencing deadlocks or allocation failures during concurrent cache retrievals, ensure that `max_local_cpu_size` is at least 2x the expected concurrent working set. The 50% rule means that with a 5GB CPU buffer (default), only ~2.5GB is available for concurrent operations at any given time.

The Insight (Rule of Thumb)

Action: The system automatically caps concurrent chunk allocation at 50% of the total budget via `WeightedSemaphore`.
Value: `concurrent_budget = chunk_budget // 2`
Trade-off: Halves effective concurrent throughput in exchange for guaranteed deadlock-free operation. Oversized requests (larger than 50% of budget) require exclusive access, blocking all other operations.
Prerequisite: Assumes `save_unfull_chunk=False` so all chunks are uniformly sized. If partial chunks are saved, the 50% bound no longer holds.

Reasoning

When all chunks have the same size and the allocator uses a contiguous memory pool, the worst-case fragmentation is bounded at 50%. This is because each allocated chunk can at most waste one chunk-worth of space due to alignment. The `AsyncMultiSerializer` explicitly documents this assumption: "Make the assumption that the save_unfull_chunk is False so that we can assume that we can always use 50% of the given memory." The oversized request path (lines 141-147) handles the edge case where a single request needs more than 50% of the budget by requiring exclusive access.

Code Evidence

WeightedSemaphore implementation from `lmcache/v1/storage_backend/storage_manager.py:118-125`:

class WeightedSemaphore:
    def __init__(self, chunk_budget: int):
        # it is physically impossible to have more fragmentation than 50%
        # when all of the chunks are of the same size (save_unfull_chunk=False)
        # so we can safely allocate half of the chunk budget for concurrent requests
        self._concurrent_budget_cap = chunk_budget // 2
        self._chunk_budget_cap = chunk_budget
        self._current_chunks = self._concurrent_budget_cap

AsyncMultiSerializer assumption from `lmcache/v1/storage_backend/storage_manager.py:162-168`:

class AsyncMultiSerializer:
    """
    Prevent race conditions where multiple batched_get's cause the local CPU
    backend to allocate memory objects in parallel and get deadlocked.
    Make the assumption that the save_unfull_chunk is False so that we
    can assume that we can always use 50% of the given memory
    """

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment