Implementation:Huggingface Transformers Quantization Verification Pattern
| Knowledge Sources | |
|---|---|
| Domains | Model_Optimization, Quantization, Testing |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Concrete pattern for verifying that BitsAndBytes 4-bit quantization has been correctly applied, as demonstrated by the Hugging Face Transformers test suite.
Description
The Transformers test suite in tests/quantization/bnb/test_4bit.py provides a comprehensive set of verification patterns for BitsAndBytes 4-bit quantization. The Bnb4BitTest class loads both a full-precision (float16) and a 4-bit quantized version of the same model, then runs a series of assertions to verify correctness.
The key verification methods are:
- test_memory_footprint -- Compares
model.get_memory_footprint()between fp16 and 4-bit models, checking the ratio against an expected value (~2.1x for bloom-1b7). Also verifies that linear layer weights are instances ofParams4bit. - test_linear_are_4bit -- Iterates over all modules and asserts that
torch.nn.Linearlayers (except excluded ones likelm_head) havetorch.uint8weight dtype. - test_generate_quality -- Generates 10 tokens from a fixed prompt and checks the output is in a known set of acceptable completions.
- test_quantization_config_json_serialization -- Verifies the config roundtrips through
to_dict(),to_diff_dict(), andto_json_string(). - test_device_assignment -- Verifies memory footprint is preserved when moving a quantized model between devices.
Usage
Use this pattern to build your own verification pipeline or to understand what checks are appropriate after quantizing a model.
Code Reference
Source Location
- Repository: transformers
- File:
tests/quantization/bnb/test_4bit.py(lines 93-370)
Signature
# Key methods from the test class (not a public API, but a verification pattern)
class Bnb4BitTest(Base4bitTest):
def test_memory_footprint(self): ...
def test_linear_are_4bit(self): ...
def test_generate_quality(self): ...
def test_quantization_config_json_serialization(self): ...
def test_device_assignment(self): ...
def test_generate_quality_config(self): ...
Import
# For verification in your own code
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import bitsandbytes as bnb
import torch
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | PreTrainedModel |
Yes | A model loaded with quantization enabled. |
| reference_model | PreTrainedModel |
No | A full-precision version of the same model for memory comparison (optional for manual verification). |
| tokenizer | PreTrainedTokenizer |
No | Tokenizer for generation quality tests. |
| input_text | str |
No | Prompt text for generation quality tests. |
Outputs
| Name | Type | Description |
|---|---|---|
| verification_result | bool |
Whether all verification checks pass. (In test context, assertion failures signal problems.) |
Usage Examples
Memory Footprint Verification
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
# Load fp16 and 4-bit versions
model_fp16 = AutoModelForCausalLM.from_pretrained(
"bigscience/bloom-1b7", dtype=torch.float16, device_map="auto"
)
model_4bit = AutoModelForCausalLM.from_pretrained(
"bigscience/bloom-1b7",
quantization_config=BitsAndBytesConfig(load_in_4bit=True),
device_map="auto",
)
mem_fp16 = model_fp16.get_memory_footprint()
mem_4bit = model_4bit.get_memory_footprint()
ratio = mem_fp16 / mem_4bit
print(f"Memory reduction ratio: {ratio:.2f}x")
# Expected: approximately 2.1x for bloom-1b7 (includes non-quantized layers)
Layer Type Inspection
import bitsandbytes as bnb
for name, module in model_4bit.named_modules():
if isinstance(module, bnb.nn.Linear4bit):
print(f"Quantized: {name} -> {module.weight.dtype}")
elif isinstance(module, torch.nn.Linear):
print(f"Full precision: {name} -> {module.weight.dtype}")
Weight Dtype Verification
from transformers import T5PreTrainedModel
excluded_modules = ["lm_head"] + T5PreTrainedModel._keep_in_fp32_modules
for name, module in model_4bit.named_modules():
if isinstance(module, torch.nn.Linear):
if name not in excluded_modules:
# 4-bit parameters are packed in uint8 variables
assert module.weight.dtype == torch.uint8, (
f"Expected uint8 for {name}, got {module.weight.dtype}"
)
Generation Quality Check
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-1b7")
input_text = "Hello my name is"
encoded = tokenizer(input_text, return_tensors="pt")
output = model_4bit.generate(
input_ids=encoded["input_ids"].to(model_4bit.device),
max_new_tokens=10,
)
decoded = tokenizer.decode(output[0], skip_special_tokens=True)
print(f"Generated: {decoded}")
# Verify output is coherent and not garbled
Config Serialization Check
config = model_4bit.config
assert hasattr(config, "quantization_config"), "Missing quantization_config"
# Test roundtrip serialization
config_dict = config.to_dict()
config_json = config.to_json_string()
diff_dict = config.to_diff_dict()
print(f"Quantization config present: True")
print(f"Serialization roundtrip: OK")