Principle:Lm sys FastChat Conversation Preprocessing
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | Conversation Preprocessing |
| Repository | lm-sys/FastChat |
| Workflow | Vicuna SFT Finetuning |
| Domains | NLP Preprocessing, Tokenization, Loss Masking, Prompt Engineering |
| Knowledge Sources | fastchat/train/train.py, Vicuna conversation template, SeparatorStyle documentation |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
This principle covers the theory and techniques for preprocessing multi-turn conversations into a format suitable for supervised fine-tuning of causal language models. It addresses prompt template application, tokenization, and the critical practice of target masking -- ensuring that the training loss is computed only on the assistant's outputs, not on user inputs or special tokens.
Description
Prompt Template Application
Raw conversations in ShareGPT format must be converted into a single text string that the model can process. This is accomplished by applying a conversation template that defines:
- Role prefixes: Each speaker's turn is preceded by a role identifier. In the Vicuna template, the human role is prefixed with
"USER: "and the assistant role with"ASSISTANT: ". - System message: An optional system-level instruction that frames the assistant's behavior (e.g., "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.").
- Separator style: The Vicuna template uses
SeparatorStyle.ADD_COLON_TWO, which means two different separators are used -- one between turns within a round (sep, typically a space) and one between conversation rounds (sep2, typically""). - Turn structure: Each turn is formatted as
ROLE: content, and turns are concatenated with the appropriate separators to form the full prompt string.
The resulting prompt string looks like:
A chat between a curious user and an artificial intelligence assistant...
USER: What is the capital of France? ASSISTANT: The capital of France is Paris.</s>USER: What is its population? ASSISTANT: Paris has a population of approximately 2.1 million...</s>
Tokenization
After template application, the prompt string is tokenized:
- The tokenizer converts the text into a sequence of input IDs (integer token indices).
- Sequences are padded to
model_max_lengthusing the pad token. - Sequences exceeding
model_max_lengthare truncated. - The tokenizer returns
input_idsas a PyTorch tensor. - Targets (labels) are initialized as a clone of
input_ids, then selectively masked.
Target Masking
Target masking is the most critical aspect of conversation preprocessing for SFT. The principle is:
"Only compute the training loss on tokens that the model should learn to generate."
In practice, this means:
- User turns are masked: All tokens corresponding to user inputs (the
"USER: ..."portions) are replaced withIGNORE_TOKEN_ID = -100in the labels tensor. The value -100 is the standard PyTorch cross-entropy ignore index, meaning these positions are excluded from the loss computation. - Assistant turns are unmasked: Tokens in the assistant's responses remain in the labels tensor, so the model learns to predict these tokens.
- Special tokens and padding are masked: BOS (beginning-of-sequence) tokens, padding tokens, and any structural tokens that are not part of the assistant's output are also set to
IGNORE_TOKEN_ID. - System prompt is masked: The initial system message and the first user turn are masked, as the model should not be trained to generate these.
The masking process works by:
- Splitting the tokenized conversation at
sep2(the end-of-turn separator, typically"") to identify individual turns. - Within each turn, splitting at the separator between the role prefix and the response content.
- Computing the token length of the user instruction portion and masking those token positions in the labels.
- Advancing through the sequence turn by turn, accumulating the current position offset.
Handling Edge Cases
The preprocessing must handle several edge cases:
- Non-human first turn: If the first turn in a conversation is not from the human role, it is skipped to ensure proper alternation.
- Tokenization mismatches: Due to differences in how tokenizers handle special tokens (legacy vs. non-legacy modes), offsets may need adjustment. The code includes hardcoded corrections (e.g., "-2" for the Llama tokenizer).
- Truncated conversations: If a conversation is longer than
model_max_length, the remaining tokens after the last complete turn are masked entirely. - Mismatch detection: If the computed current position does not match the total non-padding length, the entire sample's labels are set to
IGNORE_TOKEN_ID(effectively skipping the sample), and a warning is printed.
BOS and EOS Tokens
- The BOS (beginning-of-sequence) token is typically the first token in the sequence. It is masked in the labels (position 0 is set to
IGNORE_TOKEN_ID) because it is a structural token, not a generation target. - The EOS (end-of-sequence) token appears as part of
sep2at the end of each assistant turn. It is not masked, so the model learns to produce an end-of-turn signal.
Usage
When implementing conversation preprocessing for SFT:
- Select the appropriate conversation template (e.g., "vicuna") using
get_conversation_template. - Map the raw conversation roles ("human", "gpt") to the template's role names.
- Apply the template to generate a single prompt string per conversation.
- Tokenize all conversations with padding and truncation.
- Clone input_ids to create the labels tensor.
- Mask all non-assistant tokens in the labels with
IGNORE_TOKEN_ID (-100). - Verify alignment between computed and actual sequence lengths.
Theoretical Basis
Target masking for SFT is grounded in the principle of credit assignment: during training, the gradient signal should only flow through predictions that the model is responsible for generating. By masking user inputs:
- The model is not penalized for "failing to predict" the user's queries, which are external inputs it has no control over.
- The model's parameter updates are concentrated on improving the quality of its own responses.
- The effective training signal is cleaner, leading to faster convergence and better response quality.
This approach is analogous to teacher forcing in sequence-to-sequence models, where the decoder is trained to predict the target sequence given the source sequence, but the loss is only computed on the target side.
The use of IGNORE_TOKEN_ID = -100 leverages PyTorch's built-in support for ignore indices in CrossEntropyLoss, ensuring that masked positions contribute zero to both the loss and its gradient.