Principle:Huggingface Optimum Activation Capture via Hooks

Overview

Technique for intercepting and recording intermediate activations flowing through a neural network using PyTorch forward hooks.

Description

To quantize transformer blocks sequentially, GPTQ needs the input activations to the first transformer block. Rather than running the full model, a forward pre-hook is registered on the first transformer block that captures its input arguments and then raises a ValueError to halt the forward pass. This efficiently extracts just the inputs needed for the layer-by-layer quantization loop.

The activation capture process works as follows:

A closure (store_input_hook) is defined that captures:
- The hidden states (either from kwargs["hidden_states"] or the first positional argument).
- All additional keyword arguments (e.g., attention masks, position ids).
The hook is registered on the first block via register_forward_pre_hook(store_input_hook, with_kwargs=True).
Calibration data is passed through the model. Each forward pass triggers the hook, which stores the inputs and then raises a ValueError to abort further computation.
The try/except ValueError pattern around model(**data) catches the intentional exception silently.
After all calibration data is processed, the hook handle is removed.

All captured inputs are moved to the appropriate device for the current block being quantized.

Usage

Use when needing to capture intermediate activations without completing a full forward pass. This pattern is applied internally by GPTQQuantizer.quantize_model().

Theoretical Basis

PyTorch's hook mechanism allows registering callbacks at any module boundary. Pre-hooks execute before the module's forward() method and can inspect or modify inputs. The with_kwargs=True parameter (available in PyTorch 2.0+) enables capturing keyword arguments in addition to positional arguments.

The early-termination pattern (raising an exception from within a hook) avoids unnecessary computation beyond the capture point. This is particularly important for large models where:

Running the full forward pass would be computationally wasteful.
Only the inputs to the first block are needed for the sequential quantization algorithm.
Memory usage is reduced by not computing or storing activations from later blocks.

When cache_block_outputs is enabled (the default), this capture only happens once before the quantization loop. When disabled, the capture is repeated for each block, requiring a full pass through the preceding modules each time.

Connections

Implementation:Huggingface_Optimum_Store_Input_Hook

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment