Heuristic:Mit han lab Llm awq GPU Memory Management Patterns

Knowledge Sources	llm-awq Internal optimization patterns
Domains	Optimization, Infrastructure
Last Updated	2026-02-15 01:00 GMT

Overview

GPU memory optimization patterns using layer-wise CPU offloading, aggressive cache clearing, and state dictionary caching to enable quantization of large models on limited VRAM.

Description

AWQ quantizes models layer-by-layer, moving each transformer block to GPU for processing and back to CPU afterward. This pattern enables quantizing models that are much larger than available VRAM. The codebase uses three complementary strategies: (1) CPU-GPU toggling where individual layers are moved to CUDA for computation then back to CPU; (2) aggressive cache clearing with `torch.cuda.empty_cache()` and `gc.collect()` between quantization phases; and (3) CPU state dictionary caching where original weights are saved to CPU before grid search to enable efficient rollback without consuming GPU memory.

Usage

These patterns are essential when quantizing models on GPUs with limited VRAM (e.g., 24GB RTX 3090/4090). The layer-wise processing enables quantizing 70B+ parameter models on a single GPU. Apply these patterns when implementing custom quantization routines or when debugging OOM errors during the AWQ pipeline.

The Insight (Rule of Thumb)

Action: Move layers to CUDA only during active computation, immediately move back to CPU afterward.
Pattern: `layer.cuda()` -> process -> `layer.cpu()`; call `torch.cuda.empty_cache()` and `gc.collect()` between phases.
State Caching: Before grid search, save `{k: v.cpu() for k, v in block.state_dict().items()}` for cheap rollback.
HF Cache Disable: Set `config.use_cache = False` to prevent OOM from KV cache accumulation with transformers >= 4.36.2.
Trade-off: CPU-GPU transfers add latency but enable processing models that would otherwise not fit in VRAM.

Reasoning

Modern LLMs have 30-100+ transformer blocks. Keeping all blocks in GPU memory simultaneously is impossible for large models on consumer hardware. By processing one block at a time and keeping the rest on CPU, peak VRAM usage is bounded by a single block's size plus activation memory. The grid search in `auto_scale_block` tests 20 different scaling ratios per block, requiring weight rollback between iterations. Caching the original state dictionary on CPU (rather than re-loading from disk) provides fast rollback with minimal GPU memory cost. The aggressive `gc.collect()` + `torch.cuda.empty_cache()` pattern between quantization phases (scaling, clipping, actual quantization) prevents memory fragmentation from accumulating across the many iterations.

# From awq/quantize/auto_clip.py:77-82
# Layer-wise CPU-GPU toggling during clipping
named_linears[name].cuda()
max_val = auto_clip_layer(
    named_linears[name].weight, input_feat[name], n_bit=w_bit, q_config=q_config
)
clip_list.append((name, max_val))
named_linears[name].cpu()

# From awq/quantize/auto_scale.py:127
# CPU state dictionary caching for grid search rollback
org_sd = {k: v.cpu() for k, v in block.state_dict().items()}

# From awq/quantize/auto_clip.py:61-62
# Aggressive cache clearing after clipping
gc.collect()
torch.cuda.empty_cache()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment