Implementation:Lm sys FastChat Preprocess Conversation
| Field | Value |
|---|---|
| Page Type | Implementation (API Doc) |
| Title | Preprocess Conversation |
| Repository | lm-sys/FastChat |
| Workflow | Vicuna SFT Finetuning |
| Domains | NLP Preprocessing, Tokenization, Loss Masking |
| Knowledge Sources | fastchat/train/train.py |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
This implementation documents the preprocess function and the associated SupervisedDataset and LazySupervisedDataset classes. Together, these components transform raw ShareGPT conversations into tokenized, target-masked tensors ready for supervised fine-tuning. The preprocess function handles prompt template application, tokenization, and target masking, while the dataset classes provide PyTorch Dataset interfaces for training.
Description
The preprocess Function
The preprocess function performs three major operations:
- Prompt template application: Uses
get_conversation_template("vicuna")to obtain the Vicuna conversation template, maps raw roles ("human","gpt") to template roles ("USER","ASSISTANT"), and generates formatted prompt strings. - Tokenization: Tokenizes all conversations using the provided tokenizer with
max_lengthpadding and truncation, producinginput_idstensors. Clones these as initialtargets. - Target masking: Iterates through each conversation, splitting on
sep2(typically"") to identify turns. For each turn, identifies the user instruction portion and masks it withIGNORE_TOKEN_ID(-100). The BOS token at position 0 is also masked. Padding tokens beyond the conversation content are masked as well.
SupervisedDataset
An eager dataset that preprocesses all data at initialization:
- Calls
preprocesson all conversations at once during__init__. - Stores
input_ids,labels, andattention_masktensors as instance attributes. __getitem__returns a dictionary of tensors for a single index.- Memory-intensive but fast per-sample access.
LazySupervisedDataset
A lazy dataset that preprocesses data on demand:
- Stores raw data and tokenizer at initialization; does not tokenize.
- On
__getitem__, checks acached_data_dictfor previously processed items. - If not cached, calls
preprocessfor a single conversation, caches and returns the result. - Lower startup cost; memory grows as samples are accessed.
Usage
Code Reference
Source Location
fastchat/train/train.py:L92-253
preprocess: Lines 92-177SupervisedDataset: Lines 180-202LazySupervisedDataset: Lines 205-232
Signature
def preprocess(
sources,
tokenizer: transformers.PreTrainedTokenizer,
) -> Dict:
...
class SupervisedDataset(Dataset):
def __init__(self, raw_data, tokenizer: transformers.PreTrainedTokenizer):
...
def __len__(self) -> int:
...
def __getitem__(self, i) -> Dict[str, torch.Tensor]:
...
class LazySupervisedDataset(Dataset):
def __init__(self, raw_data, tokenizer: transformers.PreTrainedTokenizer):
...
def __len__(self) -> int:
...
def __getitem__(self, i) -> Dict[str, torch.Tensor]:
...
Import
from fastchat.train.train import preprocess, SupervisedDataset, LazySupervisedDataset
I/O Contract
preprocess Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
sources |
list[list[dict]] |
Yes | A list of conversations, where each conversation is a list of turn dictionaries with "from" (either "human" or "gpt") and "value" (text content) keys.
|
tokenizer |
transformers.PreTrainedTokenizer |
Yes | A configured tokenizer with model_max_length and pad_token set.
|
preprocess Outputs
| Key | Type | Description |
|---|---|---|
"input_ids" |
torch.Tensor (shape: [batch, seq_len]) |
Tokenized input IDs, padded to model_max_length.
|
"labels" |
torch.Tensor (shape: [batch, seq_len]) |
Target labels with user turns masked as IGNORE_TOKEN_ID (-100). Only assistant outputs have valid token IDs.
|
"attention_mask" |
torch.Tensor (shape: [batch, seq_len]) |
Boolean mask where True indicates non-padding positions.
|
SupervisedDataset / LazySupervisedDataset Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
raw_data |
list[dict] |
Yes | List of conversation dictionaries, each with a "conversations" key containing the list of turns.
|
tokenizer |
transformers.PreTrainedTokenizer |
Yes | A configured tokenizer instance. |
SupervisedDataset / LazySupervisedDataset Outputs (per item)
| Key | Type | Description |
|---|---|---|
"input_ids" |
torch.Tensor (shape: [seq_len]) |
Tokenized input IDs for a single conversation. |
"labels" |
torch.Tensor (shape: [seq_len]) |
Target labels with user turns masked. |
"attention_mask" |
torch.Tensor (shape: [seq_len]) |
Boolean attention mask for a single conversation. |
Usage Examples
Using the preprocess function directly:
from fastchat.train.train import preprocess
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained(
"lmsys/vicuna-7b-v1.5",
model_max_length=2048,
padding_side="right",
use_fast=False,
)
tokenizer.pad_token = tokenizer.unk_token
# Single conversation with two turns
sources = [
[
{"from": "human", "value": "What is the capital of France?"},
{"from": "gpt", "value": "The capital of France is Paris."},
]
]
result = preprocess(sources, tokenizer)
print(result["input_ids"].shape) # torch.Size([1, 2048])
print(result["labels"].shape) # torch.Size([1, 2048])
# Verify masking: user turn tokens should be -100
print((result["labels"][0] == -100).sum().item(), "tokens masked")
Using SupervisedDataset:
from fastchat.train.train import SupervisedDataset
raw_data = [
{
"id": "conv_001",
"conversations": [
{"from": "human", "value": "Hello!"},
{"from": "gpt", "value": "Hi there! How can I help you?"},
]
},
{
"id": "conv_002",
"conversations": [
{"from": "human", "value": "Explain gravity."},
{"from": "gpt", "value": "Gravity is a fundamental force..."},
]
},
]
dataset = SupervisedDataset(raw_data, tokenizer)
print(f"Dataset size: {len(dataset)}")
sample = dataset[0]
print(f"input_ids: {sample['input_ids'].shape}")
print(f"labels: {sample['labels'].shape}")
Key implementation detail -- IGNORE_TOKEN_ID:
from transformers.trainer_pt_utils import LabelSmoother
# IGNORE_TOKEN_ID is defined as:
IGNORE_TOKEN_ID = LabelSmoother.ignore_index # equals -100