Principle:Lm sys FastChat Causal LM Loading

Field	Value
Page Type	Principle
Title	Causal LM Loading
Repository	lm-sys/FastChat
Workflow	Vicuna SFT Finetuning
Domains	Model Loading, Transformer Architecture, Tokenizer Configuration
Knowledge Sources	fastchat/train/train.py, Hugging Face Transformers documentation, RoPE scaling literature
Last Updated	2026-02-07 14:00 GMT

Overview

This principle covers the theory and considerations involved in loading pre-trained causal language models for supervised fine-tuning. It addresses model architecture loading, tokenizer configuration, Rotary Position Embedding (RoPE) scaling for extended context windows, and cache management during training.

Description

Causal Language Model Architecture

A causal language model (also called an autoregressive or decoder-only model) generates text left-to-right, predicting each token based only on preceding tokens. When loading such a model for fine-tuning:

The pre-trained weights encode the model's learned knowledge from its original training corpus.
The model configuration specifies architectural parameters (hidden size, number of layers, number of attention heads, vocabulary size, maximum position embeddings).
The architecture class (e.g., LlamaForCausalLM, OPTForCausalLM) is automatically resolved via the AutoModelForCausalLM registry.

Tokenizer Configuration

The tokenizer must be configured consistently with the model and the training requirements:

Padding side: For causal LMs used in training, right padding is the standard choice. This ensures that the actual content tokens are left-aligned and that padding tokens appear at the end of the sequence, which is compatible with causal attention masks.
Pad token: Many pre-trained models (e.g., LLaMA) do not define a pad token by default. A common practice is to set the pad token equal to the unknown token (unk_token), since the unknown token is rarely used in well-tokenized training data.
Fast vs. slow tokenizers: Some models require the slow (Python-based) tokenizer for correct behavior, particularly when legacy tokenization modes affect special token handling.
Model max length: The tokenizer's model_max_length must match the desired training sequence length, which may exceed the model's original pre-training context length.

RoPE Scaling for Extended Context

Rotary Position Embeddings (RoPE) encode positional information by rotating query and key vectors in attention. When the desired training sequence length exceeds the model's original context length (max_position_embeddings):

Linear scaling extends the effective context window by applying a scaling factor to the RoPE frequencies.
The scaling factor is computed as the ceiling of the ratio between the desired length and the original context length: factor = ceil(model_max_length / orig_ctx_len).
This approach allows fine-tuning at longer sequence lengths without retraining position embeddings from scratch.
The scaling configuration is injected into the model config before weight loading, so the model architecture is constructed with the extended positional encoding.

Cache Management During Training

Transformer models typically use a key-value cache to speed up autoregressive generation by avoiding redundant computation. During training:

The KV cache is disabled (use_cache=False) because training processes all tokens simultaneously in a single forward pass, not autoregressively.
Disabling the cache reduces memory usage during training and avoids unnecessary computation.
The cache is re-enabled after training completes, before saving the model, so that the saved checkpoint supports efficient inference.

Usage

When loading a causal LM for SFT in the Vicuna pipeline:

Load the model configuration from the pre-trained checkpoint, applying any RoPE scaling if needed.
Set use_cache=False in the config before constructing the model.
Load the model weights using AutoModelForCausalLM.from_pretrained with the modified config.
Load the tokenizer with matching settings: padding_side="right", use_fast=False, and the correct model_max_length.
Set pad_token = unk_token if they differ.

Theoretical Basis

The practice of loading pre-trained weights for fine-tuning is grounded in transfer learning theory: a model trained on a large, general corpus captures broadly useful representations that can be efficiently adapted to specific tasks with relatively little additional data and compute.

RoPE scaling leverages the mathematical structure of rotary embeddings. Since RoPE encodes position through rotation angles, linearly scaling the frequency base effectively stretches the positional encoding to cover longer sequences. This is supported by empirical findings (Chen et al., "Extending Context Window of Large Language Models via Positional Interpolation," 2023) showing that linear interpolation/scaling of RoPE enables context extension with minimal fine-tuning.

Disabling the KV cache during training follows from the fundamental difference between training (full-sequence parallel processing) and inference (sequential token generation). The cache is an optimization for the latter and introduces unnecessary overhead during the former.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment