Implementation:Lm sys FastChat AutoModelForCausalLM From Pretrained SFT
| Field | Value |
|---|---|
| Page Type | Implementation (Wrapper Doc) |
| Title | AutoModelForCausalLM From Pretrained SFT |
| Repository | lm-sys/FastChat |
| Workflow | Vicuna SFT Finetuning |
| Domains | Model Loading, Transformer Architecture, Tokenizer Configuration |
| Knowledge Sources | fastchat/train/train.py, Hugging Face Transformers AutoModel documentation |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
This implementation documents how AutoModelForCausalLM.from_pretrained and AutoTokenizer.from_pretrained are used within the Vicuna SFT training script to load a pre-trained causal language model and its associated tokenizer. The wrapper also includes RoPE scaling logic for context window extension and cache disabling for training efficiency.
Description
The train() function in fastchat/train/train.py orchestrates model and tokenizer loading in several stages:
Stage 1: Configuration Loading and RoPE Scaling (Lines 265-275)
Before loading model weights, the model configuration is loaded and potentially modified:
# Set RoPE scaling factor
config = transformers.AutoConfig.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
trust_remote_code=model_args.trust_remote_code,
)
orig_ctx_len = getattr(config, "max_position_embeddings", None)
if orig_ctx_len and training_args.model_max_length > orig_ctx_len:
scaling_factor = float(math.ceil(training_args.model_max_length / orig_ctx_len))
config.rope_scaling = {"type": "linear", "factor": scaling_factor}
config.use_cache = False
If the desired model_max_length exceeds the model's original max_position_embeddings, a linear RoPE scaling factor is computed and injected into the configuration. The KV cache is always disabled for training.
Stage 2: Model Loading (Lines 278-283)
The model is loaded with the modified configuration:
model = transformers.AutoModelForCausalLM.from_pretrained(
model_args.model_name_or_path,
config=config,
cache_dir=training_args.cache_dir,
trust_remote_code=model_args.trust_remote_code,
)
Stage 3: Tokenizer Loading (Lines 284-294)
The tokenizer is loaded with training-specific settings:
tokenizer = transformers.AutoTokenizer.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
model_max_length=training_args.model_max_length,
padding_side=model_args.padding_side,
use_fast=False,
trust_remote_code=model_args.trust_remote_code,
)
if tokenizer.pad_token != tokenizer.unk_token:
tokenizer.pad_token = tokenizer.unk_token
Usage
Code Reference
Source Location
fastchat/train/train.py:L265-294 (RoPE scaling at L265-275, model loading at L278-283, tokenizer loading at L284-294)
Signature
The model and tokenizer loading are not wrapped in a standalone function; they occur inline within the train() function. The underlying API calls are:
# Model
model = transformers.AutoModelForCausalLM.from_pretrained(
model_name_or_path: str,
config: transformers.PretrainedConfig,
cache_dir: Optional[str],
trust_remote_code: bool,
) -> transformers.PreTrainedModel
# Tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(
model_name_or_path: str,
cache_dir: Optional[str],
model_max_length: int,
padding_side: str,
use_fast: bool,
trust_remote_code: bool,
) -> transformers.PreTrainedTokenizer
Import
import transformers
# Used as:
# transformers.AutoModelForCausalLM.from_pretrained(...)
# transformers.AutoTokenizer.from_pretrained(...)
I/O Contract
Inputs (Key Parameters)
| Parameter | Type | Default | Description |
|---|---|---|---|
model_name_or_path |
str |
"facebook/opt-125m" |
Hugging Face model identifier or local path to pre-trained model. |
model_max_length |
int |
512 |
Maximum sequence length for tokenization. Sequences are right-padded and possibly truncated to this length. |
padding_side |
str |
"right" |
Side on which the tokenizer applies padding. |
trust_remote_code |
bool |
False |
Whether to allow execution of custom model code from the Hub. |
cache_dir |
Optional[str] |
None |
Directory for caching downloaded model files. |
use_fast |
bool |
False (hardcoded) |
Use the slow (Python) tokenizer for compatibility. |
Outputs
| Output | Type | Description |
|---|---|---|
model |
transformers.PreTrainedModel |
The loaded causal LM with RoPE scaling applied (if needed) and use_cache=False.
|
tokenizer |
transformers.PreTrainedTokenizer |
The configured tokenizer with pad_token set to unk_token, right padding, and the specified model_max_length.
|
Usage Examples
Loading a Vicuna model for SFT:
import math
import transformers
model_name = "lmsys/vicuna-7b-v1.5"
model_max_length = 2048
# Step 1: Load config with RoPE scaling
config = transformers.AutoConfig.from_pretrained(model_name)
orig_ctx_len = getattr(config, "max_position_embeddings", None)
if orig_ctx_len and model_max_length > orig_ctx_len:
scaling_factor = float(math.ceil(model_max_length / orig_ctx_len))
config.rope_scaling = {"type": "linear", "factor": scaling_factor}
config.use_cache = False
# Step 2: Load model
model = transformers.AutoModelForCausalLM.from_pretrained(
model_name, config=config
)
# Step 3: Load tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(
model_name,
model_max_length=model_max_length,
padding_side="right",
use_fast=False,
)
if tokenizer.pad_token != tokenizer.unk_token:
tokenizer.pad_token = tokenizer.unk_token
External References
- Hugging Face AutoModelForCausalLM documentation: https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForCausalLM