Heuristic:LMCache LMCache Memory Fragmentation Budget
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Memory_Management |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Memory safety heuristic that reserves 50% of the chunk budget for concurrent operations, based on the proven upper bound that fragmentation cannot exceed 50% when all chunks are uniformly sized.
Description
The `WeightedSemaphore` in the `StorageManager` allocates only half of the total chunk budget for concurrent retrieval operations. This design is rooted in the observation that when `save_unfull_chunk=False` (i.e., all stored chunks are the same size), memory fragmentation physically cannot exceed 50%. The remaining 50% acts as a safety margin for fragmentation and allocation overhead. When an operation requires more than the concurrent budget (oversized request), the semaphore switches to exclusive access mode, waiting until all concurrent slots are freed before proceeding.
Usage
This heuristic applies when tuning CPU memory allocation for LMCache's local CPU backend. If you are experiencing deadlocks or allocation failures during concurrent cache retrievals, ensure that `max_local_cpu_size` is at least 2x the expected concurrent working set. The 50% rule means that with a 5GB CPU buffer (default), only ~2.5GB is available for concurrent operations at any given time.
The Insight (Rule of Thumb)
- Action: The system automatically caps concurrent chunk allocation at 50% of the total budget via `WeightedSemaphore`.
- Value: `concurrent_budget = chunk_budget // 2`
- Trade-off: Halves effective concurrent throughput in exchange for guaranteed deadlock-free operation. Oversized requests (larger than 50% of budget) require exclusive access, blocking all other operations.
- Prerequisite: Assumes `save_unfull_chunk=False` so all chunks are uniformly sized. If partial chunks are saved, the 50% bound no longer holds.
Reasoning
When all chunks have the same size and the allocator uses a contiguous memory pool, the worst-case fragmentation is bounded at 50%. This is because each allocated chunk can at most waste one chunk-worth of space due to alignment. The `AsyncMultiSerializer` explicitly documents this assumption: "Make the assumption that the save_unfull_chunk is False so that we can assume that we can always use 50% of the given memory." The oversized request path (lines 141-147) handles the edge case where a single request needs more than 50% of the budget by requiring exclusive access.
Code Evidence
WeightedSemaphore implementation from `lmcache/v1/storage_backend/storage_manager.py:118-125`:
class WeightedSemaphore:
def __init__(self, chunk_budget: int):
# it is physically impossible to have more fragmentation than 50%
# when all of the chunks are of the same size (save_unfull_chunk=False)
# so we can safely allocate half of the chunk budget for concurrent requests
self._concurrent_budget_cap = chunk_budget // 2
self._chunk_budget_cap = chunk_budget
self._current_chunks = self._concurrent_budget_cap
AsyncMultiSerializer assumption from `lmcache/v1/storage_backend/storage_manager.py:162-168`:
class AsyncMultiSerializer:
"""
Prevent race conditions where multiple batched_get's cause the local CPU
backend to allocate memory objects in parallel and get deadlocked.
Make the assumption that the save_unfull_chunk is False so that we
can assume that we can always use 50% of the given memory
"""