Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:LLMBook zh LLMBook zh github io SFTDataset Encode Src Tgt

From Leeroopedia


Knowledge Sources
Domains NLP, Training
Last Updated 2026-02-08 00:00 GMT

Overview

Concrete tool for tokenizing instruction-response pairs with loss masking on prompt tokens provided by the LLMBook repository.

Description

The SFTDataset.encode_src_tgt method tokenizes a source (prompt) and target (response) pair into a single sequence, then creates a label tensor where the prompt portion is masked with IGNORE_INDEX (-100). This ensures cross-entropy loss is only computed on the response tokens.

Usage

This method is called internally by SFTDataset.process() for each training example after template formatting.

Code Reference

Source Location

  • Repository: LLMBook-zh
  • File: code/7.2 SFT数据类.py
  • Lines: 37-45

Signature

def encode_src_tgt(self, s: str, t: str, tokenizer: PreTrainedTokenizer) -> tuple:
    """
    Tokenizes source+target and creates masked labels.

    Args:
        s: Formatted source/prompt string.
        t: Target/response string.
        tokenizer: HuggingFace tokenizer.

    Returns:
        Tuple of (input_id: Tensor, label: Tensor).
        label has source tokens masked with IGNORE_INDEX = -100.
    """

Import

from dataset.sft_dataset import SFTDataset

I/O Contract

Inputs

Name Type Required Description
s str Yes Formatted prompt string (from template formatting)
t str Yes Response/target string
tokenizer PreTrainedTokenizer Yes HuggingFace tokenizer

Outputs

Name Type Description
input_id Tensor Full tokenized sequence [prompt + response + EOS]
label Tensor Copy of input_id with prompt tokens set to -100

Usage Examples

from dataset.sft_dataset import SFTDataset
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

dataset = SFTDataset.__new__(SFTDataset)
dataset.IGNORE_INDEX = -100

prompt = "### Instruction:\nTell me a joke.\n### Output:\n"
response = "Why did the chicken cross the road?"

input_id, label = dataset.encode_src_tgt(prompt, response, tokenizer)
# label[:len(prompt_tokens)] == -100 (masked)
# label[len(prompt_tokens):] == response token IDs

Related Pages

Implements Principle

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment