Principle:LLMBook zh LLMBook zh github io Loss Masking Tokenization
| Knowledge Sources | |
|---|---|
| Domains | NLP, Training |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
A selective loss computation technique that masks prompt tokens during SFT so the model only learns to generate the response portion.
Description
Loss Masking Tokenization addresses the problem of training signal dilution in supervised fine-tuning. When training on instruction-response pairs, computing loss on the entire sequence (including the prompt) wastes capacity on learning to predict the instruction text itself. By setting the labels for prompt tokens to IGNORE_INDEX (-100), the cross-entropy loss function ignores these positions, focusing learning entirely on generating the correct response.
Usage
Use this principle whenever doing supervised fine-tuning with instruction-response data. It is standard practice in all instruction-tuning pipelines to mask the instruction portion of the labels.
Theoretical Basis
Given a concatenated sequence [prompt, response]:
- Tokenize the full sequence to get input_ids.
- Create labels as a copy of input_ids.
- Set labels[:len(prompt_tokens)] = IGNORE_INDEX (-100).
- The cross-entropy loss function in PyTorch ignores positions with label -100.
Pseudo-code:
# Abstract algorithm (NOT real implementation)
source_ids = tokenizer.encode(prompt)
full_ids = tokenizer.encode(prompt + response + eos)
labels = full_ids.clone()
labels[:len(source_ids)] = -100 # Mask prompt tokens
# Loss is only computed on response tokens