Implementation:Huggingface Transformers AutoModelForCausalLM From Pretrained For Quantization
| Knowledge Sources | |
|---|---|
| Domains | Model_Optimization, Quantization, Model_Loading |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Concrete API for loading a causal language model with on-the-fly weight quantization provided by Hugging Face Transformers.
Description
AutoModelForCausalLM.from_pretrained() is the primary entry point for loading quantized language models. When a quantization_config parameter is supplied, the method integrates with the quantization subsystem to apply weight quantization during the model loading process rather than after.
The method is defined in the auto factory module (auto_factory.py, line 250) as a classmethod on the auto model class. It performs the following quantization-relevant steps:
- Resolves the model configuration, potentially extracting an existing quantization config from a pre-quantized checkpoint.
- Passes the
quantization_configthrough to the underlying model class'sfrom_pretrained(). - The base
PreTrainedModel.from_pretrained()callsget_hf_quantizer()to instantiate the quantizer, validate the environment, and update the device map. - The quantizer preprocesses the model architecture, loads and quantizes weights, and performs post-processing.
The quantization_config is deliberately not passed to AutoConfig.from_pretrained() to avoid overwriting any existing quantization config stored in the model checkpoint. Instead, it is forwarded separately to the model loading logic where the merge logic in AutoHfQuantizer.merge_quantization_configs() resolves conflicts.
Usage
Use this API to load any causal language model with quantization enabled. It is the standard method for both on-the-fly quantization and loading pre-quantized models.
Code Reference
Source Location
- Repository: transformers
- File:
src/transformers/models/auto/auto_factory.py(lines 250-379)
Signature
class AutoModelForCausalLM:
@classmethod
def from_pretrained(
cls,
pretrained_model_name_or_path: str | os.PathLike[str],
*model_args,
config: PreTrainedConfig | None = None,
trust_remote_code: bool | None = None,
quantization_config: QuantizationConfigMixin | None = None,
device_map: str | dict | None = None,
torch_dtype: torch.dtype | str | None = None,
**kwargs,
) -> PreTrainedModel: ...
Import
from transformers import AutoModelForCausalLM
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| pretrained_model_name_or_path | str or os.PathLike |
Yes | Model identifier (Hub repo id or local path). |
| quantization_config | QuantizationConfigMixin |
No | Quantization configuration (e.g., BitsAndBytesConfig). If the model is pre-quantized, this can be omitted.
|
| device_map | str or dict |
No (strongly recommended: "auto") |
Device placement strategy. Required for most quantization backends. "auto" enables automatic placement via Accelerate.
|
| torch_dtype | torch.dtype or str |
No | Data type for non-quantized parameters (e.g., layer norms, embeddings). Commonly torch.float16 or torch.bfloat16.
|
| config | PreTrainedConfig |
No | Model configuration. If not provided, loaded automatically from the model identifier. |
| trust_remote_code | bool |
No | Whether to trust remote code for models with custom implementations. |
| token | str |
No | Authentication token for gated models on the Hub. |
| **kwargs | dict |
No | Additional keyword arguments forwarded to the underlying model's from_pretrained().
|
Outputs
| Name | Type | Description |
|---|---|---|
| model | PreTrainedModel |
A quantized model instance ready for inference or fine-tuning. Quantized layers are replaced with backend-specific modules (e.g., bnb.nn.Linear4bit).
|
Usage Examples
Basic 4-bit Loading
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=config,
device_map="auto",
)
QLoRA-optimized Loading
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.float16,
)
Loading a Pre-quantized Model
from transformers import AutoModelForCausalLM
# Pre-quantized GPTQ model -- quantization config is read from the checkpoint
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-7B-GPTQ",
device_map="auto",
)