Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:LLMBook zh LLMBook zh github io IGNORE INDEX Loss Masking

From Leeroopedia




Knowledge Sources
Domains LLMs, Supervised_Finetuning
Last Updated 2026-02-08 04:30 GMT

Overview

Use IGNORE_INDEX = -100 to mask instruction/prompt tokens in SFT labels, ensuring loss is computed only on response tokens.

Description

In supervised fine-tuning (SFT), the training data contains both instruction (source) and response (target) tokens. The model should learn to generate responses, not to predict instruction tokens. By setting label values to -100 for all instruction positions, PyTorch's CrossEntropyLoss automatically ignores these positions during loss computation. This is a standard PyTorch convention where -100 is the default ignore_index parameter.

Usage

Use this heuristic whenever performing supervised fine-tuning on instruction-response pairs. It prevents the model from wasting capacity learning to predict the instruction prefix and focuses training exclusively on generating correct responses.

The Insight (Rule of Thumb)

  • Action: Set all instruction/prompt token positions in the labels tensor to -100.
  • Value: `IGNORE_INDEX = -100` (PyTorch default for `CrossEntropyLoss.ignore_index`).
  • Trade-off: None significant. This is standard practice with no downside.
  • Implementation: Encode source separately to determine its length, then mask `label[:len(source_id)] = IGNORE_INDEX`.

Reasoning

PyTorch's `CrossEntropyLoss` accepts an `ignore_index` parameter (default -100). Any label position set to this value is excluded from both the loss computation and the gradient. Without masking, the model would compute loss on instruction tokens, diluting the training signal and teaching the model to memorize prompts rather than generate responses. This technique is universally used in instruction-tuning frameworks (Alpaca, Vicuna, LLaMA-Factory).

Code Evidence:

IGNORE_INDEX constant from `code/7.2 SFT数据类.py:5` and `code/7.1 SFT实践.py:14`:

IGNORE_INDEX = -100

Loss masking implementation from `code/7.2 SFT数据类.py:37-45`:

def encode_src_tgt(self, s, t, tokenizer):
    source_id = tokenizer.encode(s, max_length=tokenizer.model_max_length, truncation=True)
    tokenizer.add_eos_token = True
    input_id = tokenizer.encode(s + t, max_length=tokenizer.model_max_length, truncation=True,
                                return_tensors='pt')[0]
    tokenizer.add_eos_token = False
    label = input_id.clone()
    label[:len(source_id)] = self.IGNORE_INDEX
    return input_id, label

Padding with IGNORE_INDEX in collator from `code/7.1 SFT实践.py:57`:

labels = torch.nn.utils.rnn.pad_sequence(
    labels, batch_first=True, padding_value=IGNORE_INDEX
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment