Implementation:Lm sys FastChat Preprocess Conversation

Field	Value
Page Type	Implementation (API Doc)
Title	Preprocess Conversation
Repository	lm-sys/FastChat
Workflow	Vicuna SFT Finetuning
Domains	NLP Preprocessing, Tokenization, Loss Masking
Knowledge Sources	fastchat/train/train.py
Last Updated	2026-02-07 14:00 GMT

Overview

This implementation documents the preprocess function and the associated SupervisedDataset and LazySupervisedDataset classes. Together, these components transform raw ShareGPT conversations into tokenized, target-masked tensors ready for supervised fine-tuning. The preprocess function handles prompt template application, tokenization, and target masking, while the dataset classes provide PyTorch Dataset interfaces for training.

Description

The preprocess Function

The preprocess function performs three major operations:

Prompt template application: Uses get_conversation_template("vicuna") to obtain the Vicuna conversation template, maps raw roles ("human", "gpt") to template roles ("USER", "ASSISTANT"), and generates formatted prompt strings.
Tokenization: Tokenizes all conversations using the provided tokenizer with max_length padding and truncation, producing input_ids tensors. Clones these as initial targets.
Target masking: Iterates through each conversation, splitting on sep2 (typically "") to identify turns. For each turn, identifies the user instruction portion and masks it with IGNORE_TOKEN_ID (-100). The BOS token at position 0 is also masked. Padding tokens beyond the conversation content are masked as well.

SupervisedDataset

An eager dataset that preprocesses all data at initialization:

Calls preprocess on all conversations at once during __init__.
Stores input_ids, labels, and attention_mask tensors as instance attributes.
__getitem__ returns a dictionary of tensors for a single index.
Memory-intensive but fast per-sample access.

LazySupervisedDataset

A lazy dataset that preprocesses data on demand:

Stores raw data and tokenizer at initialization; does not tokenize.
On __getitem__, checks a cached_data_dict for previously processed items.
If not cached, calls preprocess for a single conversation, caches and returns the result.
Lower startup cost; memory grows as samples are accessed.

Usage

Code Reference

Source Location

fastchat/train/train.py:L92-253

preprocess: Lines 92-177
SupervisedDataset: Lines 180-202
LazySupervisedDataset: Lines 205-232

Signature

def preprocess(
    sources,
    tokenizer: transformers.PreTrainedTokenizer,
) -> Dict:
    ...

class SupervisedDataset(Dataset):
    def __init__(self, raw_data, tokenizer: transformers.PreTrainedTokenizer):
        ...
    def __len__(self) -> int:
        ...
    def __getitem__(self, i) -> Dict[str, torch.Tensor]:
        ...

class LazySupervisedDataset(Dataset):
    def __init__(self, raw_data, tokenizer: transformers.PreTrainedTokenizer):
        ...
    def __len__(self) -> int:
        ...
    def __getitem__(self, i) -> Dict[str, torch.Tensor]:
        ...

Import

from fastchat.train.train import preprocess, SupervisedDataset, LazySupervisedDataset

I/O Contract

preprocess Inputs

Parameter	Type	Required	Description
`sources`	`list[list[dict]]`	Yes	A list of conversations, where each conversation is a list of turn dictionaries with `"from"` (either `"human"` or `"gpt"`) and `"value"` (text content) keys.
`tokenizer`	`transformers.PreTrainedTokenizer`	Yes	A configured tokenizer with `model_max_length` and `pad_token` set.

preprocess Outputs

Key	Type	Description
`"input_ids"`	`torch.Tensor` (shape: `[batch, seq_len]`)	Tokenized input IDs, padded to `model_max_length`.
`"labels"`	`torch.Tensor` (shape: `[batch, seq_len]`)	Target labels with user turns masked as `IGNORE_TOKEN_ID (-100)`. Only assistant outputs have valid token IDs.
`"attention_mask"`	`torch.Tensor` (shape: `[batch, seq_len]`)	Boolean mask where `True` indicates non-padding positions.

SupervisedDataset / LazySupervisedDataset Inputs

Parameter	Type	Required	Description
`raw_data`	`list[dict]`	Yes	List of conversation dictionaries, each with a `"conversations"` key containing the list of turns.
`tokenizer`	`transformers.PreTrainedTokenizer`	Yes	A configured tokenizer instance.

SupervisedDataset / LazySupervisedDataset Outputs (per item)

Key	Type	Description
`"input_ids"`	`torch.Tensor` (shape: `[seq_len]`)	Tokenized input IDs for a single conversation.
`"labels"`	`torch.Tensor` (shape: `[seq_len]`)	Target labels with user turns masked.
`"attention_mask"`	`torch.Tensor` (shape: `[seq_len]`)	Boolean attention mask for a single conversation.

Usage Examples

Using the preprocess function directly:

from fastchat.train.train import preprocess
import transformers

tokenizer = transformers.AutoTokenizer.from_pretrained(
    "lmsys/vicuna-7b-v1.5",
    model_max_length=2048,
    padding_side="right",
    use_fast=False,
)
tokenizer.pad_token = tokenizer.unk_token

# Single conversation with two turns
sources = [
    [
        {"from": "human", "value": "What is the capital of France?"},
        {"from": "gpt", "value": "The capital of France is Paris."},
    ]
]

result = preprocess(sources, tokenizer)
print(result["input_ids"].shape)   # torch.Size([1, 2048])
print(result["labels"].shape)      # torch.Size([1, 2048])

# Verify masking: user turn tokens should be -100
print((result["labels"][0] == -100).sum().item(), "tokens masked")

Using SupervisedDataset:

from fastchat.train.train import SupervisedDataset

raw_data = [
    {
        "id": "conv_001",
        "conversations": [
            {"from": "human", "value": "Hello!"},
            {"from": "gpt", "value": "Hi there! How can I help you?"},
        ]
    },
    {
        "id": "conv_002",
        "conversations": [
            {"from": "human", "value": "Explain gravity."},
            {"from": "gpt", "value": "Gravity is a fundamental force..."},
        ]
    },
]

dataset = SupervisedDataset(raw_data, tokenizer)
print(f"Dataset size: {len(dataset)}")

sample = dataset[0]
print(f"input_ids: {sample['input_ids'].shape}")
print(f"labels: {sample['labels'].shape}")

Key implementation detail -- IGNORE_TOKEN_ID:

from transformers.trainer_pt_utils import LabelSmoother

# IGNORE_TOKEN_ID is defined as:
IGNORE_TOKEN_ID = LabelSmoother.ignore_index  # equals -100

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment