Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Allenai Open instruct AutoModelForCausalLM From Pretrained

From Leeroopedia


Knowledge Sources
Domains Machine Learning, Deep Learning, Natural Language Processing
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for loading a pre-trained causal language model with optional quantization and LoRA configuration, as used in the Open Instruct fine-tuning pipeline.

Description

This documents the model loading block in finetune.py, which wraps HuggingFace's AutoModelForCausalLM.from_pretrained() with Open Instruct's specific configuration patterns. The loading logic supports three modes:

  1. QLoRA mode: Loads the model with 4-bit NF4 quantization via bitsandbytes, then wraps it with LoRA adapters using PEFT.
  2. Liger Kernel mode: Uses AutoLigerKernelForCausalLM for fused linear cross-entropy optimization.
  3. Standard mode: Loads the model in bfloat16 with optional flash attention.

After loading, the code resizes token embeddings if the tokenizer vocabulary is larger than the model's embedding layer (padded to multiples of 8 for tensor core efficiency). If LoRA is enabled, the model is wrapped with get_peft_model() using a LoraConfig targeting the attention and MLP projection layers.

Usage

This loading pattern is invoked automatically by the main() function in finetune.py. External users configure it through the FlatArguments dataclass fields: model_name_or_path, use_flash_attn, use_lora, use_qlora, lora_rank, lora_alpha, lora_dropout.

Code Reference

Source Location

  • Repository: Open Instruct
  • File: open_instruct/finetune.py
  • Lines: L494-599 (model loading and LoRA configuration block)

Signature

This is not a standalone function but a code block within main(). The key calls are:

# Standard loading
model = AutoModelForCausalLM.from_pretrained(
    args.model_name_or_path,
    revision=args.model_revision,
    from_tf=bool(".ckpt" in args.model_name_or_path),
    config=config,
    trust_remote_code=tc.trust_remote_code,
    low_cpu_mem_usage=args.low_cpu_mem_usage,
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2" if args.use_flash_attn else "eager",
)

# QLoRA loading
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
    args.model_name_or_path,
    quantization_config=bnb_config,
    device_map={"": device_index},
    ...
)

# LoRA wrapping
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=args.lora_rank,
    lora_alpha=args.lora_alpha,
    lora_dropout=args.lora_dropout,
    target_modules=["q_proj", "o_proj", "v_proj", "k_proj",
                     "gate_proj", "up_proj", "down_proj"],
)
model = get_peft_model(model, peft_config)

Import

from transformers import AutoModelForCausalLM, AutoConfig, BitsAndBytesConfig
from peft import LoraConfig, TaskType, get_peft_model, prepare_model_for_kbit_training

I/O Contract

Inputs

Name Type Required Description
model_name_or_path str Yes HuggingFace model ID or local path to the pre-trained model.
model_revision str or None No Specific model version (branch, tag, or commit hash).
use_flash_attn bool No Whether to use Flash Attention 2. Defaults to True.
use_lora bool No Whether to apply LoRA adapters. Defaults to False.
use_qlora bool No Whether to use 4-bit quantization with LoRA. Defaults to False. Implies use_lora=True.
lora_rank int No Rank of the LoRA decomposition. Defaults to 64.
lora_alpha float No LoRA scaling factor. Defaults to 16.
lora_dropout float No Dropout rate for LoRA layers. Defaults to 0.1.
low_cpu_mem_usage bool No Whether to use low CPU memory mode for loading. Defaults to False.
trust_remote_code bool No Whether to trust remote code in model definition. Defaults to False.
use_liger_kernel bool No Whether to use LigerKernel fused operations. Defaults to False.

Outputs

Name Type Description
model AutoModelForCausalLM (or PeftModel) The loaded model ready for training. If LoRA is enabled, this is a PeftModel wrapping the base model.

Usage Examples

Basic Usage

from transformers import AutoModelForCausalLM
import torch

# Standard full fine-tuning
model = AutoModelForCausalLM.from_pretrained(
    "allenai/Llama-3.1-Tulu-3-8B",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

QLoRA Loading

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, TaskType, get_peft_model, prepare_model_for_kbit_training
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "allenai/Llama-3.1-Tulu-3-8B",
    quantization_config=bnb_config,
    device_map={"": 0},
    attn_implementation="flash_attention_2",
)

model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)

peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=64,
    lora_alpha=16,
    lora_dropout=0.1,
    target_modules=["q_proj", "o_proj", "v_proj", "k_proj",
                     "gate_proj", "up_proj", "down_proj"],
)
model = get_peft_model(model, peft_config)

Dependencies

  • transformers -- provides AutoModelForCausalLM, AutoConfig, BitsAndBytesConfig
  • peft -- provides LoraConfig, get_peft_model, prepare_model_for_kbit_training
  • bitsandbytes -- required for QLoRA 4-bit quantization
  • flash-attn -- required when use_flash_attn=True
  • torch -- PyTorch for tensor operations and model execution

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment