Implementation:Hiyouga LLaMA Factory Quantization
| Knowledge Sources | |
|---|---|
| Domains | Model Quantization, Memory Optimization |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
Configures model quantization across multiple backends and methods for memory-efficient model loading and training.
Description
The quantization module handles three quantization paths with a defined priority order: (1) PTQ-quantized models (GPTQ, AWQ, AQLM, MXFP4, FP8) that are already quantized, (2) AutoGPTQ export quantization that prepares calibration datasets and quantizes on-the-fly, and (3) on-the-fly quantization via BitsAndBytes (4/8-bit), HQQ (1-8 bit), or EETQ (8-bit). Each path validates compatibility with distributed training frameworks such as DeepSpeed ZeRO-3 and FSDP, and configures device maps accordingly. The internal helper _get_quantization_dataset tokenizes and samples calibration data for AutoGPTQ export.
Usage
Use this module when loading a model that requires quantization, whether the model is already quantized (PTQ), needs to be exported in quantized form (AutoGPTQ), or should be quantized on-the-fly during loading for training or inference. It is called internally by patch_config in the model patcher pipeline.
Code Reference
Source Location
- Repository: Hiyouga_LLaMA_Factory
- File: src/llamafactory/model/model_utils/quantization.py
- Lines: 1-216
Signature
def _get_quantization_dataset(
tokenizer: "PreTrainedTokenizer",
model_args: "ModelArguments",
) -> list[dict[str, Any]]
def configure_quantization(
config: "PretrainedConfig",
tokenizer: "PreTrainedTokenizer",
model_args: "ModelArguments",
is_trainable: bool,
init_kwargs: dict[str, Any],
) -> None
Import
from llamafactory.model.model_utils.quantization import configure_quantization
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | PretrainedConfig | Yes | The pretrained model configuration object, checked for existing quantization_config |
| tokenizer | PreTrainedTokenizer | Yes | Tokenizer used to prepare calibration data for AutoGPTQ export |
| model_args | ModelArguments | Yes | Contains quantization_bit, quantization_method, export_quantization_bit, compute_dtype, and related settings |
| is_trainable | bool | Yes | Whether the model is being loaded for training (affects device map and compatibility checks) |
| init_kwargs | dict[str, Any] | Yes | Mutable dictionary of model initialization kwargs; quantization_config and device_map are injected here |
Outputs
| Name | Type | Description |
|---|---|---|
| None | None | The function modifies init_kwargs in-place, injecting quantization_config, device_map, and max_memory as needed |
Usage Examples
# Typical internal usage during model loading
from llamafactory.model.model_utils.quantization import configure_quantization
init_kwargs = {}
configure_quantization(
config=model_config,
tokenizer=tokenizer,
model_args=model_args,
is_trainable=True,
init_kwargs=init_kwargs,
)
# init_kwargs now contains quantization_config for the selected method
Related Pages
- Hiyouga_LLaMA_Factory_Model_Patcher - Calls configure_quantization as part of patch_config
- Hiyouga_LLaMA_Factory_FP8_Utils - Alternative FP8 mixed-precision training configuration