Implementation:Allenai Open instruct AutoModelForCausalLM From Pretrained
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Deep Learning, Natural Language Processing |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for loading a pre-trained causal language model with optional quantization and LoRA configuration, as used in the Open Instruct fine-tuning pipeline.
Description
This documents the model loading block in finetune.py, which wraps HuggingFace's AutoModelForCausalLM.from_pretrained() with Open Instruct's specific configuration patterns. The loading logic supports three modes:
- QLoRA mode: Loads the model with 4-bit NF4 quantization via bitsandbytes, then wraps it with LoRA adapters using PEFT.
- Liger Kernel mode: Uses
AutoLigerKernelForCausalLMfor fused linear cross-entropy optimization. - Standard mode: Loads the model in bfloat16 with optional flash attention.
After loading, the code resizes token embeddings if the tokenizer vocabulary is larger than the model's embedding layer (padded to multiples of 8 for tensor core efficiency). If LoRA is enabled, the model is wrapped with get_peft_model() using a LoraConfig targeting the attention and MLP projection layers.
Usage
This loading pattern is invoked automatically by the main() function in finetune.py. External users configure it through the FlatArguments dataclass fields: model_name_or_path, use_flash_attn, use_lora, use_qlora, lora_rank, lora_alpha, lora_dropout.
Code Reference
Source Location
- Repository: Open Instruct
- File:
open_instruct/finetune.py - Lines: L494-599 (model loading and LoRA configuration block)
Signature
This is not a standalone function but a code block within main(). The key calls are:
# Standard loading
model = AutoModelForCausalLM.from_pretrained(
args.model_name_or_path,
revision=args.model_revision,
from_tf=bool(".ckpt" in args.model_name_or_path),
config=config,
trust_remote_code=tc.trust_remote_code,
low_cpu_mem_usage=args.low_cpu_mem_usage,
dtype=torch.bfloat16,
attn_implementation="flash_attention_2" if args.use_flash_attn else "eager",
)
# QLoRA loading
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
args.model_name_or_path,
quantization_config=bnb_config,
device_map={"": device_index},
...
)
# LoRA wrapping
peft_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
inference_mode=False,
r=args.lora_rank,
lora_alpha=args.lora_alpha,
lora_dropout=args.lora_dropout,
target_modules=["q_proj", "o_proj", "v_proj", "k_proj",
"gate_proj", "up_proj", "down_proj"],
)
model = get_peft_model(model, peft_config)
Import
from transformers import AutoModelForCausalLM, AutoConfig, BitsAndBytesConfig
from peft import LoraConfig, TaskType, get_peft_model, prepare_model_for_kbit_training
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_name_or_path | str | Yes | HuggingFace model ID or local path to the pre-trained model. |
| model_revision | str or None | No | Specific model version (branch, tag, or commit hash). |
| use_flash_attn | bool | No | Whether to use Flash Attention 2. Defaults to True. |
| use_lora | bool | No | Whether to apply LoRA adapters. Defaults to False. |
| use_qlora | bool | No | Whether to use 4-bit quantization with LoRA. Defaults to False. Implies use_lora=True.
|
| lora_rank | int | No | Rank of the LoRA decomposition. Defaults to 64. |
| lora_alpha | float | No | LoRA scaling factor. Defaults to 16. |
| lora_dropout | float | No | Dropout rate for LoRA layers. Defaults to 0.1. |
| low_cpu_mem_usage | bool | No | Whether to use low CPU memory mode for loading. Defaults to False. |
| trust_remote_code | bool | No | Whether to trust remote code in model definition. Defaults to False. |
| use_liger_kernel | bool | No | Whether to use LigerKernel fused operations. Defaults to False. |
Outputs
| Name | Type | Description |
|---|---|---|
| model | AutoModelForCausalLM (or PeftModel) | The loaded model ready for training. If LoRA is enabled, this is a PeftModel wrapping the base model.
|
Usage Examples
Basic Usage
from transformers import AutoModelForCausalLM
import torch
# Standard full fine-tuning
model = AutoModelForCausalLM.from_pretrained(
"allenai/Llama-3.1-Tulu-3-8B",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
QLoRA Loading
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, TaskType, get_peft_model, prepare_model_for_kbit_training
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
"allenai/Llama-3.1-Tulu-3-8B",
quantization_config=bnb_config,
device_map={"": 0},
attn_implementation="flash_attention_2",
)
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
peft_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=64,
lora_alpha=16,
lora_dropout=0.1,
target_modules=["q_proj", "o_proj", "v_proj", "k_proj",
"gate_proj", "up_proj", "down_proj"],
)
model = get_peft_model(model, peft_config)
Dependencies
- transformers -- provides
AutoModelForCausalLM,AutoConfig,BitsAndBytesConfig - peft -- provides
LoraConfig,get_peft_model,prepare_model_for_kbit_training - bitsandbytes -- required for QLoRA 4-bit quantization
- flash-attn -- required when
use_flash_attn=True - torch -- PyTorch for tensor operations and model execution