Implementation:LLMBook zh LLMBook zh github io SFTDataset Encode Src Tgt

Knowledge Sources	LLMBook-zh
Domains	NLP, Training
Last Updated	2026-02-08 00:00 GMT

Overview

Concrete tool for tokenizing instruction-response pairs with loss masking on prompt tokens provided by the LLMBook repository.

Description

The SFTDataset.encode_src_tgt method tokenizes a source (prompt) and target (response) pair into a single sequence, then creates a label tensor where the prompt portion is masked with IGNORE_INDEX (-100). This ensures cross-entropy loss is only computed on the response tokens.

Usage

This method is called internally by SFTDataset.process() for each training example after template formatting.

Code Reference

Source Location

Repository: LLMBook-zh
File: code/7.2 SFT数据类.py
Lines: 37-45

Signature

def encode_src_tgt(self, s: str, t: str, tokenizer: PreTrainedTokenizer) -> tuple:
    """
    Tokenizes source+target and creates masked labels.

    Args:
        s: Formatted source/prompt string.
        t: Target/response string.
        tokenizer: HuggingFace tokenizer.

    Returns:
        Tuple of (input_id: Tensor, label: Tensor).
        label has source tokens masked with IGNORE_INDEX = -100.
    """

Import

from dataset.sft_dataset import SFTDataset

I/O Contract

Inputs

Name	Type	Required	Description
s	str	Yes	Formatted prompt string (from template formatting)
t	str	Yes	Response/target string
tokenizer	PreTrainedTokenizer	Yes	HuggingFace tokenizer

Outputs

Name	Type	Description
input_id	Tensor	Full tokenized sequence [prompt + response + EOS]
label	Tensor	Copy of input_id with prompt tokens set to -100

Usage Examples

from dataset.sft_dataset import SFTDataset
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

dataset = SFTDataset.__new__(SFTDataset)
dataset.IGNORE_INDEX = -100

prompt = "### Instruction:\nTell me a joke.\n### Output:\n"
response = "Why did the chicken cross the road?"

input_id, label = dataset.encode_src_tgt(prompt, response, tokenizer)
# label[:len(prompt_tokens)] == -100 (masked)
# label[len(prompt_tokens):] == response token IDs

Related Pages

Implements Principle

Principle:LLMBook_zh_LLMBook_zh_github_io_Loss_Masking_Tokenization

Uses Heuristic

Heuristic:LLMBook_zh_LLMBook_zh_github_io_IGNORE_INDEX_Loss_Masking

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment