Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Lm sys FastChat AutoModelForCausalLM From Pretrained SFT

From Leeroopedia


Field Value
Page Type Implementation (Wrapper Doc)
Title AutoModelForCausalLM From Pretrained SFT
Repository lm-sys/FastChat
Workflow Vicuna SFT Finetuning
Domains Model Loading, Transformer Architecture, Tokenizer Configuration
Knowledge Sources fastchat/train/train.py, Hugging Face Transformers AutoModel documentation
Last Updated 2026-02-07 14:00 GMT

Overview

This implementation documents how AutoModelForCausalLM.from_pretrained and AutoTokenizer.from_pretrained are used within the Vicuna SFT training script to load a pre-trained causal language model and its associated tokenizer. The wrapper also includes RoPE scaling logic for context window extension and cache disabling for training efficiency.

Description

The train() function in fastchat/train/train.py orchestrates model and tokenizer loading in several stages:

Stage 1: Configuration Loading and RoPE Scaling (Lines 265-275)

Before loading model weights, the model configuration is loaded and potentially modified:

# Set RoPE scaling factor
config = transformers.AutoConfig.from_pretrained(
    model_args.model_name_or_path,
    cache_dir=training_args.cache_dir,
    trust_remote_code=model_args.trust_remote_code,
)
orig_ctx_len = getattr(config, "max_position_embeddings", None)
if orig_ctx_len and training_args.model_max_length > orig_ctx_len:
    scaling_factor = float(math.ceil(training_args.model_max_length / orig_ctx_len))
    config.rope_scaling = {"type": "linear", "factor": scaling_factor}
config.use_cache = False

If the desired model_max_length exceeds the model's original max_position_embeddings, a linear RoPE scaling factor is computed and injected into the configuration. The KV cache is always disabled for training.

Stage 2: Model Loading (Lines 278-283)

The model is loaded with the modified configuration:

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_args.model_name_or_path,
    config=config,
    cache_dir=training_args.cache_dir,
    trust_remote_code=model_args.trust_remote_code,
)

Stage 3: Tokenizer Loading (Lines 284-294)

The tokenizer is loaded with training-specific settings:

tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_args.model_name_or_path,
    cache_dir=training_args.cache_dir,
    model_max_length=training_args.model_max_length,
    padding_side=model_args.padding_side,
    use_fast=False,
    trust_remote_code=model_args.trust_remote_code,
)

if tokenizer.pad_token != tokenizer.unk_token:
    tokenizer.pad_token = tokenizer.unk_token

Usage

Code Reference

Source Location

fastchat/train/train.py:L265-294 (RoPE scaling at L265-275, model loading at L278-283, tokenizer loading at L284-294)

Signature

The model and tokenizer loading are not wrapped in a standalone function; they occur inline within the train() function. The underlying API calls are:

# Model
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_name_or_path: str,
    config: transformers.PretrainedConfig,
    cache_dir: Optional[str],
    trust_remote_code: bool,
) -> transformers.PreTrainedModel

# Tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_name_or_path: str,
    cache_dir: Optional[str],
    model_max_length: int,
    padding_side: str,
    use_fast: bool,
    trust_remote_code: bool,
) -> transformers.PreTrainedTokenizer

Import

import transformers
# Used as:
# transformers.AutoModelForCausalLM.from_pretrained(...)
# transformers.AutoTokenizer.from_pretrained(...)

I/O Contract

Inputs (Key Parameters)

Parameter Type Default Description
model_name_or_path str "facebook/opt-125m" Hugging Face model identifier or local path to pre-trained model.
model_max_length int 512 Maximum sequence length for tokenization. Sequences are right-padded and possibly truncated to this length.
padding_side str "right" Side on which the tokenizer applies padding.
trust_remote_code bool False Whether to allow execution of custom model code from the Hub.
cache_dir Optional[str] None Directory for caching downloaded model files.
use_fast bool False (hardcoded) Use the slow (Python) tokenizer for compatibility.

Outputs

Output Type Description
model transformers.PreTrainedModel The loaded causal LM with RoPE scaling applied (if needed) and use_cache=False.
tokenizer transformers.PreTrainedTokenizer The configured tokenizer with pad_token set to unk_token, right padding, and the specified model_max_length.

Usage Examples

Loading a Vicuna model for SFT:

import math
import transformers

model_name = "lmsys/vicuna-7b-v1.5"
model_max_length = 2048

# Step 1: Load config with RoPE scaling
config = transformers.AutoConfig.from_pretrained(model_name)
orig_ctx_len = getattr(config, "max_position_embeddings", None)
if orig_ctx_len and model_max_length > orig_ctx_len:
    scaling_factor = float(math.ceil(model_max_length / orig_ctx_len))
    config.rope_scaling = {"type": "linear", "factor": scaling_factor}
config.use_cache = False

# Step 2: Load model
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_name, config=config
)

# Step 3: Load tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_name,
    model_max_length=model_max_length,
    padding_side="right",
    use_fast=False,
)
if tokenizer.pad_token != tokenizer.unk_token:
    tokenizer.pad_token = tokenizer.unk_token

External References

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment