Implementation:Huggingface Transformers AutoModelForCausalLM From Pretrained For Quantization

Knowledge Sources	Transformers AutoModel Quantization
Domains	Model_Optimization, Quantization, Model_Loading
Last Updated	2026-02-13 00:00 GMT

Overview

Concrete API for loading a causal language model with on-the-fly weight quantization provided by Hugging Face Transformers.

Description

AutoModelForCausalLM.from_pretrained() is the primary entry point for loading quantized language models. When a quantization_config parameter is supplied, the method integrates with the quantization subsystem to apply weight quantization during the model loading process rather than after.

The method is defined in the auto factory module (auto_factory.py, line 250) as a classmethod on the auto model class. It performs the following quantization-relevant steps:

Resolves the model configuration, potentially extracting an existing quantization config from a pre-quantized checkpoint.
Passes the quantization_config through to the underlying model class's from_pretrained().
The base PreTrainedModel.from_pretrained() calls get_hf_quantizer() to instantiate the quantizer, validate the environment, and update the device map.
The quantizer preprocesses the model architecture, loads and quantizes weights, and performs post-processing.

The quantization_config is deliberately not passed to AutoConfig.from_pretrained() to avoid overwriting any existing quantization config stored in the model checkpoint. Instead, it is forwarded separately to the model loading logic where the merge logic in AutoHfQuantizer.merge_quantization_configs() resolves conflicts.

Usage

Use this API to load any causal language model with quantization enabled. It is the standard method for both on-the-fly quantization and loading pre-quantized models.

Code Reference

Source Location

Repository: transformers
File: src/transformers/models/auto/auto_factory.py (lines 250-379)

Signature

class AutoModelForCausalLM:
    @classmethod
    def from_pretrained(
        cls,
        pretrained_model_name_or_path: str | os.PathLike[str],
        *model_args,
        config: PreTrainedConfig | None = None,
        trust_remote_code: bool | None = None,
        quantization_config: QuantizationConfigMixin | None = None,
        device_map: str | dict | None = None,
        torch_dtype: torch.dtype | str | None = None,
        **kwargs,
    ) -> PreTrainedModel: ...

Import

from transformers import AutoModelForCausalLM

I/O Contract

Inputs

Name	Type	Required	Description
pretrained_model_name_or_path	`str` or `os.PathLike`	Yes	Model identifier (Hub repo id or local path).
quantization_config	`QuantizationConfigMixin`	No	Quantization configuration (e.g., `BitsAndBytesConfig`). If the model is pre-quantized, this can be omitted.
device_map	`str` or `dict`	No (strongly recommended: `"auto"`)	Device placement strategy. Required for most quantization backends. `"auto"` enables automatic placement via Accelerate.
torch_dtype	`torch.dtype` or `str`	No	Data type for non-quantized parameters (e.g., layer norms, embeddings). Commonly `torch.float16` or `torch.bfloat16`.
config	`PreTrainedConfig`	No	Model configuration. If not provided, loaded automatically from the model identifier.
trust_remote_code	`bool`	No	Whether to trust remote code for models with custom implementations.
token	`str`	No	Authentication token for gated models on the Hub.
**kwargs	`dict`	No	Additional keyword arguments forwarded to the underlying model's `from_pretrained()`.

Outputs

Name	Type	Description
model	`PreTrainedModel`	A quantized model instance ready for inference or fine-tuning. Quantized layers are replaced with backend-specific modules (e.g., `bnb.nn.Linear4bit`).

Usage Examples

Basic 4-bit Loading

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

config = BitsAndBytesConfig(load_in_4bit=True)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=config,
    device_map="auto",
)

QLoRA-optimized Loading

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.float16,
)

Loading a Pre-quantized Model

from transformers import AutoModelForCausalLM

# Pre-quantized GPTQ model -- quantization config is read from the checkpoint
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-7B-GPTQ",
    device_map="auto",
)

Related Pages

Implements Principle

Principle:Huggingface_Transformers_Quantized_Model_Loading

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment