Heuristic:LLMBook zh LLMBook zh github io IGNORE INDEX Loss Masking
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Supervised_Finetuning |
| Last Updated | 2026-02-08 04:30 GMT |
Overview
Use IGNORE_INDEX = -100 to mask instruction/prompt tokens in SFT labels, ensuring loss is computed only on response tokens.
Description
In supervised fine-tuning (SFT), the training data contains both instruction (source) and response (target) tokens. The model should learn to generate responses, not to predict instruction tokens. By setting label values to -100 for all instruction positions, PyTorch's CrossEntropyLoss automatically ignores these positions during loss computation. This is a standard PyTorch convention where -100 is the default ignore_index parameter.
Usage
Use this heuristic whenever performing supervised fine-tuning on instruction-response pairs. It prevents the model from wasting capacity learning to predict the instruction prefix and focuses training exclusively on generating correct responses.
The Insight (Rule of Thumb)
- Action: Set all instruction/prompt token positions in the labels tensor to -100.
- Value: `IGNORE_INDEX = -100` (PyTorch default for `CrossEntropyLoss.ignore_index`).
- Trade-off: None significant. This is standard practice with no downside.
- Implementation: Encode source separately to determine its length, then mask `label[:len(source_id)] = IGNORE_INDEX`.
Reasoning
PyTorch's `CrossEntropyLoss` accepts an `ignore_index` parameter (default -100). Any label position set to this value is excluded from both the loss computation and the gradient. Without masking, the model would compute loss on instruction tokens, diluting the training signal and teaching the model to memorize prompts rather than generate responses. This technique is universally used in instruction-tuning frameworks (Alpaca, Vicuna, LLaMA-Factory).
Code Evidence:
IGNORE_INDEX constant from `code/7.2 SFT数据类.py:5` and `code/7.1 SFT实践.py:14`:
IGNORE_INDEX = -100
Loss masking implementation from `code/7.2 SFT数据类.py:37-45`:
def encode_src_tgt(self, s, t, tokenizer):
source_id = tokenizer.encode(s, max_length=tokenizer.model_max_length, truncation=True)
tokenizer.add_eos_token = True
input_id = tokenizer.encode(s + t, max_length=tokenizer.model_max_length, truncation=True,
return_tensors='pt')[0]
tokenizer.add_eos_token = False
label = input_id.clone()
label[:len(source_id)] = self.IGNORE_INDEX
return input_id, label
Padding with IGNORE_INDEX in collator from `code/7.1 SFT实践.py:57`:
labels = torch.nn.utils.rnn.pad_sequence(
labels, batch_first=True, padding_value=IGNORE_INDEX
)