Principle:Lm sys FastChat Conversation Preprocessing

Field	Value
Page Type	Principle
Title	Conversation Preprocessing
Repository	lm-sys/FastChat
Workflow	Vicuna SFT Finetuning
Domains	NLP Preprocessing, Tokenization, Loss Masking, Prompt Engineering
Knowledge Sources	fastchat/train/train.py, Vicuna conversation template, SeparatorStyle documentation
Last Updated	2026-02-07 14:00 GMT

Overview

This principle covers the theory and techniques for preprocessing multi-turn conversations into a format suitable for supervised fine-tuning of causal language models. It addresses prompt template application, tokenization, and the critical practice of target masking -- ensuring that the training loss is computed only on the assistant's outputs, not on user inputs or special tokens.

Description

Prompt Template Application

Raw conversations in ShareGPT format must be converted into a single text string that the model can process. This is accomplished by applying a conversation template that defines:

Role prefixes: Each speaker's turn is preceded by a role identifier. In the Vicuna template, the human role is prefixed with "USER: " and the assistant role with "ASSISTANT: ".
System message: An optional system-level instruction that frames the assistant's behavior (e.g., "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.").
Separator style: The Vicuna template uses SeparatorStyle.ADD_COLON_TWO, which means two different separators are used -- one between turns within a round (sep, typically a space) and one between conversation rounds (sep2, typically "").
Turn structure: Each turn is formatted as ROLE: content, and turns are concatenated with the appropriate separators to form the full prompt string.

The resulting prompt string looks like:

A chat between a curious user and an artificial intelligence assistant...

USER: What is the capital of France? ASSISTANT: The capital of France is Paris.</s>USER: What is its population? ASSISTANT: Paris has a population of approximately 2.1 million...</s>

Tokenization

After template application, the prompt string is tokenized:

The tokenizer converts the text into a sequence of input IDs (integer token indices).
Sequences are padded to model_max_length using the pad token.
Sequences exceeding model_max_length are truncated.
The tokenizer returns input_ids as a PyTorch tensor.
Targets (labels) are initialized as a clone of input_ids, then selectively masked.

Target Masking

Target masking is the most critical aspect of conversation preprocessing for SFT. The principle is:

"Only compute the training loss on tokens that the model should learn to generate."

In practice, this means:

User turns are masked: All tokens corresponding to user inputs (the "USER: ..." portions) are replaced with IGNORE_TOKEN_ID = -100 in the labels tensor. The value -100 is the standard PyTorch cross-entropy ignore index, meaning these positions are excluded from the loss computation.
Assistant turns are unmasked: Tokens in the assistant's responses remain in the labels tensor, so the model learns to predict these tokens.
Special tokens and padding are masked: BOS (beginning-of-sequence) tokens, padding tokens, and any structural tokens that are not part of the assistant's output are also set to IGNORE_TOKEN_ID.
System prompt is masked: The initial system message and the first user turn are masked, as the model should not be trained to generate these.

The masking process works by:

Splitting the tokenized conversation at sep2 (the end-of-turn separator, typically "") to identify individual turns.
Within each turn, splitting at the separator between the role prefix and the response content.
Computing the token length of the user instruction portion and masking those token positions in the labels.
Advancing through the sequence turn by turn, accumulating the current position offset.

Handling Edge Cases

The preprocessing must handle several edge cases:

Non-human first turn: If the first turn in a conversation is not from the human role, it is skipped to ensure proper alternation.
Tokenization mismatches: Due to differences in how tokenizers handle special tokens (legacy vs. non-legacy modes), offsets may need adjustment. The code includes hardcoded corrections (e.g., "-2" for the Llama tokenizer).
Truncated conversations: If a conversation is longer than model_max_length, the remaining tokens after the last complete turn are masked entirely.
Mismatch detection: If the computed current position does not match the total non-padding length, the entire sample's labels are set to IGNORE_TOKEN_ID (effectively skipping the sample), and a warning is printed.

BOS and EOS Tokens

The BOS (beginning-of-sequence) token is typically the first token in the sequence. It is masked in the labels (position 0 is set to IGNORE_TOKEN_ID) because it is a structural token, not a generation target.
The EOS (end-of-sequence) token appears as part of sep2 at the end of each assistant turn. It is not masked, so the model learns to produce an end-of-turn signal.

Usage

When implementing conversation preprocessing for SFT:

Select the appropriate conversation template (e.g., "vicuna") using get_conversation_template.
Map the raw conversation roles ("human", "gpt") to the template's role names.
Apply the template to generate a single prompt string per conversation.
Tokenize all conversations with padding and truncation.
Clone input_ids to create the labels tensor.
Mask all non-assistant tokens in the labels with IGNORE_TOKEN_ID (-100).
Verify alignment between computed and actual sequence lengths.

Theoretical Basis

Target masking for SFT is grounded in the principle of credit assignment: during training, the gradient signal should only flow through predictions that the model is responsible for generating. By masking user inputs:

The model is not penalized for "failing to predict" the user's queries, which are external inputs it has no control over.
The model's parameter updates are concentrated on improving the quality of its own responses.
The effective training signal is cleaner, leading to faster convergence and better response quality.

This approach is analogous to teacher forcing in sequence-to-sequence models, where the decoder is trained to predict the target sequence given the source sequence, but the loss is only computed on the target side.

The use of IGNORE_TOKEN_ID = -100 leverages PyTorch's built-in support for ignore indices in CrossEntropyLoss, ensuring that masked positions contribute zero to both the loss and its gradient.

Related Pages

Implementation:Lm_sys_FastChat_Preprocess_Conversation
Implemented by: Implementation:Lm_sys_FastChat_Preprocess_Conversation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment