Principle:Huggingface Optimum Activation Capture via Hooks
Overview
Technique for intercepting and recording intermediate activations flowing through a neural network using PyTorch forward hooks.
Description
To quantize transformer blocks sequentially, GPTQ needs the input activations to the first transformer block. Rather than running the full model, a forward pre-hook is registered on the first transformer block that captures its input arguments and then raises a ValueError to halt the forward pass. This efficiently extracts just the inputs needed for the layer-by-layer quantization loop.
The activation capture process works as follows:
- A closure (
store_input_hook) is defined that captures:- The hidden states (either from
kwargs["hidden_states"]or the first positional argument). - All additional keyword arguments (e.g., attention masks, position ids).
- The hidden states (either from
- The hook is registered on the first block via
register_forward_pre_hook(store_input_hook, with_kwargs=True). - Calibration data is passed through the model. Each forward pass triggers the hook, which stores the inputs and then raises a
ValueErrorto abort further computation. - The
try/except ValueErrorpattern aroundmodel(**data)catches the intentional exception silently. - After all calibration data is processed, the hook handle is removed.
All captured inputs are moved to the appropriate device for the current block being quantized.
Usage
Use when needing to capture intermediate activations without completing a full forward pass. This pattern is applied internally by GPTQQuantizer.quantize_model().
Theoretical Basis
PyTorch's hook mechanism allows registering callbacks at any module boundary. Pre-hooks execute before the module's forward() method and can inspect or modify inputs. The with_kwargs=True parameter (available in PyTorch 2.0+) enables capturing keyword arguments in addition to positional arguments.
The early-termination pattern (raising an exception from within a hook) avoids unnecessary computation beyond the capture point. This is particularly important for large models where:
- Running the full forward pass would be computationally wasteful.
- Only the inputs to the first block are needed for the sequential quantization algorithm.
- Memory usage is reduced by not computing or storing activations from later blocks.
When cache_block_outputs is enabled (the default), this capture only happens once before the quantization loop. When disabled, the capture is repeated for each block, requiring a full pass through the preceding modules each time.
Related
- implemented_by → Implementation:Huggingface_Optimum_Store_Input_Hook