Heuristic:FMInference FlexLLMGen Pin Memory Tradeoffs

Knowledge Sources	FlexLLMGen Code Comments
Domains	Optimization, LLM_Inference
Last Updated	2026-02-09 12:00 GMT

Overview

Pinned memory is enabled by default for CPU weight tensors to accelerate GPU transfers, but is explicitly disabled for KV cache tensors due to high memory overhead.

Description

PyTorch's pinned (page-locked) memory enables faster CPU-to-GPU data transfers via DMA, bypassing the normal virtual memory paging system. FlexLLMGen makes different pinning decisions for different tensor types: weights on CPU are pinned by default (--pin-weight True) because they are large and transferred frequently during the block schedule. However, KV cache tensors on CPU have pinned memory explicitly disabled, as noted in code comments: "disable pin_memory due to high memory overhead." This is because pinned memory cannot be swapped to disk and effectively doubles the memory footprint (the OS must keep both the pinned allocation and the page tables resident).

Usage

Use this heuristic when tuning memory usage for CPU-offloaded inference. If CPU memory is tight, disable weight pinning with --pin-weight 0 to save ~20%+ memory. If you observe slow CPU-to-GPU transfers, ensure weight pinning is enabled (the default).

The Insight (Rule of Thumb)

Action (weights): Keep --pin-weight True (default) for maximum transfer speed. Set --pin-weight 0 when CPU memory is scarce, accepting ~20% slower transfers.
Action (KV cache): Never pin KV cache on CPU. The codebase explicitly sets pin_memory = False for cache allocation.
Action (general CPU tensors): The TorchDevice.allocate method defaults to pin_memory=True for CPU tensors unless overridden.
Trade-off: Pinned memory provides faster DMA transfers but locks physical pages in RAM, preventing swapping and increasing resident memory usage.

Reasoning

KV cache tensors are allocated per-layer, per-batch with shape (prompt_len + gen_len - 1, gpu_batch_size * num_head, head_dim). For a model with many layers and large batch sizes, these tensors can consume tens of gigabytes. Pinning all of them would effectively double their CPU memory footprint because:

The pinned allocation locks physical RAM pages.
The OS cannot swap pinned pages, so they compete with other allocations.
For large models like OPT-175B, this would require hundreds of gigabytes of non-swappable RAM.

In contrast, weights are pinned by default because the block schedule reuses each weight tensor across all micro-batches in a layer. The transfer speed gain from pinning is amortized across many uses, making the memory cost worthwhile.

When CPU memory is insufficient, unpinning weights (--pin-weight 0) triggers a relay mechanism in the copy code: non-pinned CPU tensors are first copied to a small pinned relay buffer before being sent to GPU. This is slower but uses far less memory.

Code Evidence

KV cache pin_memory disabled in flexllmgen/pytorch_backend.py:292-293:

# NOTE: disable pin_memory due to high memory overhead
pin_memory = False

Same pattern for compressed cache in flexllmgen/compression.py:56-57:

# NOTE: disable pin_memory due to high memory overhead
pin_memory = False

Default CPU pin_memory behavior in flexllmgen/pytorch_backend.py:184-190:

def allocate(self, shape, dtype, pin_memory=None, name=None):
    if self.device_type == DeviceType.CPU:
        pin_memory = True if pin_memory is None else pin_memory
    else:
        pin_memory = False
    dtype = np_dtype_to_torch_dtype[dtype]
    data = torch.empty(shape, dtype=dtype, pin_memory=pin_memory, device=self.dev)

Non-pinned relay mechanism in flexllmgen/pytorch_backend.py:845-849:

elif (src.device.device_type == DeviceType.CPU and
      dst.device.device_type == DeviceType.CUDA and
      not src.data.is_pinned()):
    # The cpu tensor is not pinned, use pin_memory as a relay
    src = src.pin_memory()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment