Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hiyouga LLaMA Factory Quantization

From Leeroopedia


Knowledge Sources
Domains Model Quantization, Memory Optimization
Last Updated 2026-02-06 19:00 GMT

Overview

Configures model quantization across multiple backends and methods for memory-efficient model loading and training.

Description

The quantization module handles three quantization paths with a defined priority order: (1) PTQ-quantized models (GPTQ, AWQ, AQLM, MXFP4, FP8) that are already quantized, (2) AutoGPTQ export quantization that prepares calibration datasets and quantizes on-the-fly, and (3) on-the-fly quantization via BitsAndBytes (4/8-bit), HQQ (1-8 bit), or EETQ (8-bit). Each path validates compatibility with distributed training frameworks such as DeepSpeed ZeRO-3 and FSDP, and configures device maps accordingly. The internal helper _get_quantization_dataset tokenizes and samples calibration data for AutoGPTQ export.

Usage

Use this module when loading a model that requires quantization, whether the model is already quantized (PTQ), needs to be exported in quantized form (AutoGPTQ), or should be quantized on-the-fly during loading for training or inference. It is called internally by patch_config in the model patcher pipeline.

Code Reference

Source Location

Signature

def _get_quantization_dataset(
    tokenizer: "PreTrainedTokenizer",
    model_args: "ModelArguments",
) -> list[dict[str, Any]]

def configure_quantization(
    config: "PretrainedConfig",
    tokenizer: "PreTrainedTokenizer",
    model_args: "ModelArguments",
    is_trainable: bool,
    init_kwargs: dict[str, Any],
) -> None

Import

from llamafactory.model.model_utils.quantization import configure_quantization

I/O Contract

Inputs

Name Type Required Description
config PretrainedConfig Yes The pretrained model configuration object, checked for existing quantization_config
tokenizer PreTrainedTokenizer Yes Tokenizer used to prepare calibration data for AutoGPTQ export
model_args ModelArguments Yes Contains quantization_bit, quantization_method, export_quantization_bit, compute_dtype, and related settings
is_trainable bool Yes Whether the model is being loaded for training (affects device map and compatibility checks)
init_kwargs dict[str, Any] Yes Mutable dictionary of model initialization kwargs; quantization_config and device_map are injected here

Outputs

Name Type Description
None None The function modifies init_kwargs in-place, injecting quantization_config, device_map, and max_memory as needed

Usage Examples

# Typical internal usage during model loading
from llamafactory.model.model_utils.quantization import configure_quantization

init_kwargs = {}
configure_quantization(
    config=model_config,
    tokenizer=tokenizer,
    model_args=model_args,
    is_trainable=True,
    init_kwargs=init_kwargs,
)
# init_kwargs now contains quantization_config for the selected method

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment