Implementation:Lm sys FastChat AutoModelForCausalLM From Pretrained SFT

Field	Value
Page Type	Implementation (Wrapper Doc)
Title	AutoModelForCausalLM From Pretrained SFT
Repository	lm-sys/FastChat
Workflow	Vicuna SFT Finetuning
Domains	Model Loading, Transformer Architecture, Tokenizer Configuration
Knowledge Sources	fastchat/train/train.py, Hugging Face Transformers AutoModel documentation
Last Updated	2026-02-07 14:00 GMT

Overview

This implementation documents how AutoModelForCausalLM.from_pretrained and AutoTokenizer.from_pretrained are used within the Vicuna SFT training script to load a pre-trained causal language model and its associated tokenizer. The wrapper also includes RoPE scaling logic for context window extension and cache disabling for training efficiency.

Description

The train() function in fastchat/train/train.py orchestrates model and tokenizer loading in several stages:

Stage 1: Configuration Loading and RoPE Scaling (Lines 265-275)

Before loading model weights, the model configuration is loaded and potentially modified:

# Set RoPE scaling factor
config = transformers.AutoConfig.from_pretrained(
    model_args.model_name_or_path,
    cache_dir=training_args.cache_dir,
    trust_remote_code=model_args.trust_remote_code,
)
orig_ctx_len = getattr(config, "max_position_embeddings", None)
if orig_ctx_len and training_args.model_max_length > orig_ctx_len:
    scaling_factor = float(math.ceil(training_args.model_max_length / orig_ctx_len))
    config.rope_scaling = {"type": "linear", "factor": scaling_factor}
config.use_cache = False

If the desired model_max_length exceeds the model's original max_position_embeddings, a linear RoPE scaling factor is computed and injected into the configuration. The KV cache is always disabled for training.

Stage 2: Model Loading (Lines 278-283)

The model is loaded with the modified configuration:

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_args.model_name_or_path,
    config=config,
    cache_dir=training_args.cache_dir,
    trust_remote_code=model_args.trust_remote_code,
)

Stage 3: Tokenizer Loading (Lines 284-294)

The tokenizer is loaded with training-specific settings:

tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_args.model_name_or_path,
    cache_dir=training_args.cache_dir,
    model_max_length=training_args.model_max_length,
    padding_side=model_args.padding_side,
    use_fast=False,
    trust_remote_code=model_args.trust_remote_code,
)

if tokenizer.pad_token != tokenizer.unk_token:
    tokenizer.pad_token = tokenizer.unk_token

Usage

Code Reference

Source Location

fastchat/train/train.py:L265-294 (RoPE scaling at L265-275, model loading at L278-283, tokenizer loading at L284-294)

Signature

The model and tokenizer loading are not wrapped in a standalone function; they occur inline within the train() function. The underlying API calls are:

# Model
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_name_or_path: str,
    config: transformers.PretrainedConfig,
    cache_dir: Optional[str],
    trust_remote_code: bool,
) -> transformers.PreTrainedModel

# Tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_name_or_path: str,
    cache_dir: Optional[str],
    model_max_length: int,
    padding_side: str,
    use_fast: bool,
    trust_remote_code: bool,
) -> transformers.PreTrainedTokenizer

Import

import transformers
# Used as:
# transformers.AutoModelForCausalLM.from_pretrained(...)
# transformers.AutoTokenizer.from_pretrained(...)

I/O Contract

Inputs (Key Parameters)

Parameter	Type	Default	Description
`model_name_or_path`	`str`	`"facebook/opt-125m"`	Hugging Face model identifier or local path to pre-trained model.
`model_max_length`	`int`	`512`	Maximum sequence length for tokenization. Sequences are right-padded and possibly truncated to this length.
`padding_side`	`str`	`"right"`	Side on which the tokenizer applies padding.
`trust_remote_code`	`bool`	`False`	Whether to allow execution of custom model code from the Hub.
`cache_dir`	`Optional[str]`	`None`	Directory for caching downloaded model files.
`use_fast`	`bool`	`False` (hardcoded)	Use the slow (Python) tokenizer for compatibility.

Outputs

Output	Type	Description
`model`	`transformers.PreTrainedModel`	The loaded causal LM with RoPE scaling applied (if needed) and `use_cache=False`.
`tokenizer`	`transformers.PreTrainedTokenizer`	The configured tokenizer with `pad_token` set to `unk_token`, right padding, and the specified `model_max_length`.

Usage Examples

Loading a Vicuna model for SFT:

import math
import transformers

model_name = "lmsys/vicuna-7b-v1.5"
model_max_length = 2048

# Step 1: Load config with RoPE scaling
config = transformers.AutoConfig.from_pretrained(model_name)
orig_ctx_len = getattr(config, "max_position_embeddings", None)
if orig_ctx_len and model_max_length > orig_ctx_len:
    scaling_factor = float(math.ceil(model_max_length / orig_ctx_len))
    config.rope_scaling = {"type": "linear", "factor": scaling_factor}
config.use_cache = False

# Step 2: Load model
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_name, config=config
)

# Step 3: Load tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_name,
    model_max_length=model_max_length,
    padding_side="right",
    use_fast=False,
)
if tokenizer.pad_token != tokenizer.unk_token:
    tokenizer.pad_token = tokenizer.unk_token

External References

Hugging Face AutoModelForCausalLM documentation: https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForCausalLM

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment