Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Transformers AutoModelForCausalLM From Pretrained For Quantization

From Leeroopedia
Knowledge Sources
Domains Model_Optimization, Quantization, Model_Loading
Last Updated 2026-02-13 00:00 GMT

Overview

Concrete API for loading a causal language model with on-the-fly weight quantization provided by Hugging Face Transformers.

Description

AutoModelForCausalLM.from_pretrained() is the primary entry point for loading quantized language models. When a quantization_config parameter is supplied, the method integrates with the quantization subsystem to apply weight quantization during the model loading process rather than after.

The method is defined in the auto factory module (auto_factory.py, line 250) as a classmethod on the auto model class. It performs the following quantization-relevant steps:

  1. Resolves the model configuration, potentially extracting an existing quantization config from a pre-quantized checkpoint.
  2. Passes the quantization_config through to the underlying model class's from_pretrained().
  3. The base PreTrainedModel.from_pretrained() calls get_hf_quantizer() to instantiate the quantizer, validate the environment, and update the device map.
  4. The quantizer preprocesses the model architecture, loads and quantizes weights, and performs post-processing.

The quantization_config is deliberately not passed to AutoConfig.from_pretrained() to avoid overwriting any existing quantization config stored in the model checkpoint. Instead, it is forwarded separately to the model loading logic where the merge logic in AutoHfQuantizer.merge_quantization_configs() resolves conflicts.

Usage

Use this API to load any causal language model with quantization enabled. It is the standard method for both on-the-fly quantization and loading pre-quantized models.

Code Reference

Source Location

  • Repository: transformers
  • File: src/transformers/models/auto/auto_factory.py (lines 250-379)

Signature

class AutoModelForCausalLM:
    @classmethod
    def from_pretrained(
        cls,
        pretrained_model_name_or_path: str | os.PathLike[str],
        *model_args,
        config: PreTrainedConfig | None = None,
        trust_remote_code: bool | None = None,
        quantization_config: QuantizationConfigMixin | None = None,
        device_map: str | dict | None = None,
        torch_dtype: torch.dtype | str | None = None,
        **kwargs,
    ) -> PreTrainedModel: ...

Import

from transformers import AutoModelForCausalLM

I/O Contract

Inputs

Name Type Required Description
pretrained_model_name_or_path str or os.PathLike Yes Model identifier (Hub repo id or local path).
quantization_config QuantizationConfigMixin No Quantization configuration (e.g., BitsAndBytesConfig). If the model is pre-quantized, this can be omitted.
device_map str or dict No (strongly recommended: "auto") Device placement strategy. Required for most quantization backends. "auto" enables automatic placement via Accelerate.
torch_dtype torch.dtype or str No Data type for non-quantized parameters (e.g., layer norms, embeddings). Commonly torch.float16 or torch.bfloat16.
config PreTrainedConfig No Model configuration. If not provided, loaded automatically from the model identifier.
trust_remote_code bool No Whether to trust remote code for models with custom implementations.
token str No Authentication token for gated models on the Hub.
**kwargs dict No Additional keyword arguments forwarded to the underlying model's from_pretrained().

Outputs

Name Type Description
model PreTrainedModel A quantized model instance ready for inference or fine-tuning. Quantized layers are replaced with backend-specific modules (e.g., bnb.nn.Linear4bit).

Usage Examples

Basic 4-bit Loading

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

config = BitsAndBytesConfig(load_in_4bit=True)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=config,
    device_map="auto",
)

QLoRA-optimized Loading

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.float16,
)

Loading a Pre-quantized Model

from transformers import AutoModelForCausalLM

# Pre-quantized GPTQ model -- quantization config is read from the checkpoint
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-7B-GPTQ",
    device_map="auto",
)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment