Implementation:Allenai Open instruct AutoModelForCausalLM From Pretrained

Knowledge Sources	Open Instruct HuggingFace Transformers AutoModelForCausalLM
Domains	Machine Learning, Deep Learning, Natural Language Processing
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for loading a pre-trained causal language model with optional quantization and LoRA configuration, as used in the Open Instruct fine-tuning pipeline.

Description

This documents the model loading block in finetune.py, which wraps HuggingFace's AutoModelForCausalLM.from_pretrained() with Open Instruct's specific configuration patterns. The loading logic supports three modes:

QLoRA mode: Loads the model with 4-bit NF4 quantization via bitsandbytes, then wraps it with LoRA adapters using PEFT.
Liger Kernel mode: Uses AutoLigerKernelForCausalLM for fused linear cross-entropy optimization.
Standard mode: Loads the model in bfloat16 with optional flash attention.

After loading, the code resizes token embeddings if the tokenizer vocabulary is larger than the model's embedding layer (padded to multiples of 8 for tensor core efficiency). If LoRA is enabled, the model is wrapped with get_peft_model() using a LoraConfig targeting the attention and MLP projection layers.

Usage

This loading pattern is invoked automatically by the main() function in finetune.py. External users configure it through the FlatArguments dataclass fields: model_name_or_path, use_flash_attn, use_lora, use_qlora, lora_rank, lora_alpha, lora_dropout.

Code Reference

Source Location

Repository: Open Instruct
File: open_instruct/finetune.py
Lines: L494-599 (model loading and LoRA configuration block)

Signature

This is not a standalone function but a code block within main(). The key calls are:

# Standard loading
model = AutoModelForCausalLM.from_pretrained(
    args.model_name_or_path,
    revision=args.model_revision,
    from_tf=bool(".ckpt" in args.model_name_or_path),
    config=config,
    trust_remote_code=tc.trust_remote_code,
    low_cpu_mem_usage=args.low_cpu_mem_usage,
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2" if args.use_flash_attn else "eager",
)

# QLoRA loading
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
    args.model_name_or_path,
    quantization_config=bnb_config,
    device_map={"": device_index},
    ...
)

# LoRA wrapping
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=args.lora_rank,
    lora_alpha=args.lora_alpha,
    lora_dropout=args.lora_dropout,
    target_modules=["q_proj", "o_proj", "v_proj", "k_proj",
                     "gate_proj", "up_proj", "down_proj"],
)
model = get_peft_model(model, peft_config)

Import

from transformers import AutoModelForCausalLM, AutoConfig, BitsAndBytesConfig
from peft import LoraConfig, TaskType, get_peft_model, prepare_model_for_kbit_training

I/O Contract

Inputs

Name	Type	Required	Description
model_name_or_path	str	Yes	HuggingFace model ID or local path to the pre-trained model.
model_revision	str or None	No	Specific model version (branch, tag, or commit hash).
use_flash_attn	bool	No	Whether to use Flash Attention 2. Defaults to True.
use_lora	bool	No	Whether to apply LoRA adapters. Defaults to False.
use_qlora	bool	No	Whether to use 4-bit quantization with LoRA. Defaults to False. Implies `use_lora=True`.
lora_rank	int	No	Rank of the LoRA decomposition. Defaults to 64.
lora_alpha	float	No	LoRA scaling factor. Defaults to 16.
lora_dropout	float	No	Dropout rate for LoRA layers. Defaults to 0.1.
low_cpu_mem_usage	bool	No	Whether to use low CPU memory mode for loading. Defaults to False.
trust_remote_code	bool	No	Whether to trust remote code in model definition. Defaults to False.
use_liger_kernel	bool	No	Whether to use LigerKernel fused operations. Defaults to False.

Outputs

Name	Type	Description
model	AutoModelForCausalLM (or PeftModel)	The loaded model ready for training. If LoRA is enabled, this is a `PeftModel` wrapping the base model.

Usage Examples

Basic Usage

from transformers import AutoModelForCausalLM
import torch

# Standard full fine-tuning
model = AutoModelForCausalLM.from_pretrained(
    "allenai/Llama-3.1-Tulu-3-8B",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

QLoRA Loading

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, TaskType, get_peft_model, prepare_model_for_kbit_training
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "allenai/Llama-3.1-Tulu-3-8B",
    quantization_config=bnb_config,
    device_map={"": 0},
    attn_implementation="flash_attention_2",
)

model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)

peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=64,
    lora_alpha=16,
    lora_dropout=0.1,
    target_modules=["q_proj", "o_proj", "v_proj", "k_proj",
                     "gate_proj", "up_proj", "down_proj"],
)
model = get_peft_model(model, peft_config)

Dependencies

transformers -- provides AutoModelForCausalLM, AutoConfig, BitsAndBytesConfig
peft -- provides LoraConfig, get_peft_model, prepare_model_for_kbit_training
bitsandbytes -- required for QLoRA 4-bit quantization
flash-attn -- required when use_flash_attn=True
torch -- PyTorch for tensor operations and model execution

Related Pages

Implements Principle

Principle:Allenai_Open_instruct_Causal_LM_Loading

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment