Heuristic:FMInference FlexLLMGen Pin Memory Tradeoffs
| Knowledge Sources | |
|---|---|
| Domains | Optimization, LLM_Inference |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Pinned memory is enabled by default for CPU weight tensors to accelerate GPU transfers, but is explicitly disabled for KV cache tensors due to high memory overhead.
Description
PyTorch's pinned (page-locked) memory enables faster CPU-to-GPU data transfers via DMA, bypassing the normal virtual memory paging system. FlexLLMGen makes different pinning decisions for different tensor types: weights on CPU are pinned by default (--pin-weight True) because they are large and transferred frequently during the block schedule. However, KV cache tensors on CPU have pinned memory explicitly disabled, as noted in code comments: "disable pin_memory due to high memory overhead." This is because pinned memory cannot be swapped to disk and effectively doubles the memory footprint (the OS must keep both the pinned allocation and the page tables resident).
Usage
Use this heuristic when tuning memory usage for CPU-offloaded inference. If CPU memory is tight, disable weight pinning with --pin-weight 0 to save ~20%+ memory. If you observe slow CPU-to-GPU transfers, ensure weight pinning is enabled (the default).
The Insight (Rule of Thumb)
- Action (weights): Keep
--pin-weight True(default) for maximum transfer speed. Set--pin-weight 0when CPU memory is scarce, accepting ~20% slower transfers. - Action (KV cache): Never pin KV cache on CPU. The codebase explicitly sets
pin_memory = Falsefor cache allocation. - Action (general CPU tensors): The
TorchDevice.allocatemethod defaults topin_memory=Truefor CPU tensors unless overridden. - Trade-off: Pinned memory provides faster DMA transfers but locks physical pages in RAM, preventing swapping and increasing resident memory usage.
Reasoning
KV cache tensors are allocated per-layer, per-batch with shape (prompt_len + gen_len - 1, gpu_batch_size * num_head, head_dim). For a model with many layers and large batch sizes, these tensors can consume tens of gigabytes. Pinning all of them would effectively double their CPU memory footprint because:
- The pinned allocation locks physical RAM pages.
- The OS cannot swap pinned pages, so they compete with other allocations.
- For large models like OPT-175B, this would require hundreds of gigabytes of non-swappable RAM.
In contrast, weights are pinned by default because the block schedule reuses each weight tensor across all micro-batches in a layer. The transfer speed gain from pinning is amortized across many uses, making the memory cost worthwhile.
When CPU memory is insufficient, unpinning weights (--pin-weight 0) triggers a relay mechanism in the copy code: non-pinned CPU tensors are first copied to a small pinned relay buffer before being sent to GPU. This is slower but uses far less memory.
Code Evidence
KV cache pin_memory disabled in flexllmgen/pytorch_backend.py:292-293:
# NOTE: disable pin_memory due to high memory overhead
pin_memory = False
Same pattern for compressed cache in flexllmgen/compression.py:56-57:
# NOTE: disable pin_memory due to high memory overhead
pin_memory = False
Default CPU pin_memory behavior in flexllmgen/pytorch_backend.py:184-190:
def allocate(self, shape, dtype, pin_memory=None, name=None):
if self.device_type == DeviceType.CPU:
pin_memory = True if pin_memory is None else pin_memory
else:
pin_memory = False
dtype = np_dtype_to_torch_dtype[dtype]
data = torch.empty(shape, dtype=dtype, pin_memory=pin_memory, device=self.dev)
Non-pinned relay mechanism in flexllmgen/pytorch_backend.py:845-849:
elif (src.device.device_type == DeviceType.CPU and
dst.device.device_type == DeviceType.CUDA and
not src.data.is_pinned()):
# The cpu tensor is not pinned, use pin_memory as a relay
src = src.pin_memory()