Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:LLMBook zh LLMBook zh github io Loss Masking Tokenization

From Leeroopedia


Knowledge Sources
Domains NLP, Training
Last Updated 2026-02-08 00:00 GMT

Overview

A selective loss computation technique that masks prompt tokens during SFT so the model only learns to generate the response portion.

Description

Loss Masking Tokenization addresses the problem of training signal dilution in supervised fine-tuning. When training on instruction-response pairs, computing loss on the entire sequence (including the prompt) wastes capacity on learning to predict the instruction text itself. By setting the labels for prompt tokens to IGNORE_INDEX (-100), the cross-entropy loss function ignores these positions, focusing learning entirely on generating the correct response.

Usage

Use this principle whenever doing supervised fine-tuning with instruction-response data. It is standard practice in all instruction-tuning pipelines to mask the instruction portion of the labels.

Theoretical Basis

Given a concatenated sequence [prompt, response]:

  1. Tokenize the full sequence to get input_ids.
  2. Create labels as a copy of input_ids.
  3. Set labels[:len(prompt_tokens)] = IGNORE_INDEX (-100).
  4. The cross-entropy loss function in PyTorch ignores positions with label -100.

Pseudo-code:

# Abstract algorithm (NOT real implementation)
source_ids = tokenizer.encode(prompt)
full_ids = tokenizer.encode(prompt + response + eos)
labels = full_ids.clone()
labels[:len(source_ids)] = -100  # Mask prompt tokens
# Loss is only computed on response tokens

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment