Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Lm sys FastChat Preprocess Conversation

From Leeroopedia


Field Value
Page Type Implementation (API Doc)
Title Preprocess Conversation
Repository lm-sys/FastChat
Workflow Vicuna SFT Finetuning
Domains NLP Preprocessing, Tokenization, Loss Masking
Knowledge Sources fastchat/train/train.py
Last Updated 2026-02-07 14:00 GMT

Overview

This implementation documents the preprocess function and the associated SupervisedDataset and LazySupervisedDataset classes. Together, these components transform raw ShareGPT conversations into tokenized, target-masked tensors ready for supervised fine-tuning. The preprocess function handles prompt template application, tokenization, and target masking, while the dataset classes provide PyTorch Dataset interfaces for training.

Description

The preprocess Function

The preprocess function performs three major operations:

  1. Prompt template application: Uses get_conversation_template("vicuna") to obtain the Vicuna conversation template, maps raw roles ("human", "gpt") to template roles ("USER", "ASSISTANT"), and generates formatted prompt strings.
  2. Tokenization: Tokenizes all conversations using the provided tokenizer with max_length padding and truncation, producing input_ids tensors. Clones these as initial targets.
  3. Target masking: Iterates through each conversation, splitting on sep2 (typically "") to identify turns. For each turn, identifies the user instruction portion and masks it with IGNORE_TOKEN_ID (-100). The BOS token at position 0 is also masked. Padding tokens beyond the conversation content are masked as well.

SupervisedDataset

An eager dataset that preprocesses all data at initialization:

  • Calls preprocess on all conversations at once during __init__.
  • Stores input_ids, labels, and attention_mask tensors as instance attributes.
  • __getitem__ returns a dictionary of tensors for a single index.
  • Memory-intensive but fast per-sample access.

LazySupervisedDataset

A lazy dataset that preprocesses data on demand:

  • Stores raw data and tokenizer at initialization; does not tokenize.
  • On __getitem__, checks a cached_data_dict for previously processed items.
  • If not cached, calls preprocess for a single conversation, caches and returns the result.
  • Lower startup cost; memory grows as samples are accessed.

Usage

Code Reference

Source Location

fastchat/train/train.py:L92-253

  • preprocess: Lines 92-177
  • SupervisedDataset: Lines 180-202
  • LazySupervisedDataset: Lines 205-232

Signature

def preprocess(
    sources,
    tokenizer: transformers.PreTrainedTokenizer,
) -> Dict:
    ...
class SupervisedDataset(Dataset):
    def __init__(self, raw_data, tokenizer: transformers.PreTrainedTokenizer):
        ...
    def __len__(self) -> int:
        ...
    def __getitem__(self, i) -> Dict[str, torch.Tensor]:
        ...
class LazySupervisedDataset(Dataset):
    def __init__(self, raw_data, tokenizer: transformers.PreTrainedTokenizer):
        ...
    def __len__(self) -> int:
        ...
    def __getitem__(self, i) -> Dict[str, torch.Tensor]:
        ...

Import

from fastchat.train.train import preprocess, SupervisedDataset, LazySupervisedDataset

I/O Contract

preprocess Inputs

Parameter Type Required Description
sources list[list[dict]] Yes A list of conversations, where each conversation is a list of turn dictionaries with "from" (either "human" or "gpt") and "value" (text content) keys.
tokenizer transformers.PreTrainedTokenizer Yes A configured tokenizer with model_max_length and pad_token set.

preprocess Outputs

Key Type Description
"input_ids" torch.Tensor (shape: [batch, seq_len]) Tokenized input IDs, padded to model_max_length.
"labels" torch.Tensor (shape: [batch, seq_len]) Target labels with user turns masked as IGNORE_TOKEN_ID (-100). Only assistant outputs have valid token IDs.
"attention_mask" torch.Tensor (shape: [batch, seq_len]) Boolean mask where True indicates non-padding positions.

SupervisedDataset / LazySupervisedDataset Inputs

Parameter Type Required Description
raw_data list[dict] Yes List of conversation dictionaries, each with a "conversations" key containing the list of turns.
tokenizer transformers.PreTrainedTokenizer Yes A configured tokenizer instance.

SupervisedDataset / LazySupervisedDataset Outputs (per item)

Key Type Description
"input_ids" torch.Tensor (shape: [seq_len]) Tokenized input IDs for a single conversation.
"labels" torch.Tensor (shape: [seq_len]) Target labels with user turns masked.
"attention_mask" torch.Tensor (shape: [seq_len]) Boolean attention mask for a single conversation.

Usage Examples

Using the preprocess function directly:

from fastchat.train.train import preprocess
import transformers

tokenizer = transformers.AutoTokenizer.from_pretrained(
    "lmsys/vicuna-7b-v1.5",
    model_max_length=2048,
    padding_side="right",
    use_fast=False,
)
tokenizer.pad_token = tokenizer.unk_token

# Single conversation with two turns
sources = [
    [
        {"from": "human", "value": "What is the capital of France?"},
        {"from": "gpt", "value": "The capital of France is Paris."},
    ]
]

result = preprocess(sources, tokenizer)
print(result["input_ids"].shape)   # torch.Size([1, 2048])
print(result["labels"].shape)      # torch.Size([1, 2048])

# Verify masking: user turn tokens should be -100
print((result["labels"][0] == -100).sum().item(), "tokens masked")

Using SupervisedDataset:

from fastchat.train.train import SupervisedDataset

raw_data = [
    {
        "id": "conv_001",
        "conversations": [
            {"from": "human", "value": "Hello!"},
            {"from": "gpt", "value": "Hi there! How can I help you?"},
        ]
    },
    {
        "id": "conv_002",
        "conversations": [
            {"from": "human", "value": "Explain gravity."},
            {"from": "gpt", "value": "Gravity is a fundamental force..."},
        ]
    },
]

dataset = SupervisedDataset(raw_data, tokenizer)
print(f"Dataset size: {len(dataset)}")

sample = dataset[0]
print(f"input_ids: {sample['input_ids'].shape}")
print(f"labels: {sample['labels'].shape}")

Key implementation detail -- IGNORE_TOKEN_ID:

from transformers.trainer_pt_utils import LabelSmoother

# IGNORE_TOKEN_ID is defined as:
IGNORE_TOKEN_ID = LabelSmoother.ignore_index  # equals -100

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment