Implementation:LLMBook zh LLMBook zh github io SFTDataset Encode Src Tgt
Appearance
| Knowledge Sources | |
|---|---|
| Domains | NLP, Training |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for tokenizing instruction-response pairs with loss masking on prompt tokens provided by the LLMBook repository.
Description
The SFTDataset.encode_src_tgt method tokenizes a source (prompt) and target (response) pair into a single sequence, then creates a label tensor where the prompt portion is masked with IGNORE_INDEX (-100). This ensures cross-entropy loss is only computed on the response tokens.
Usage
This method is called internally by SFTDataset.process() for each training example after template formatting.
Code Reference
Source Location
- Repository: LLMBook-zh
- File: code/7.2 SFT数据类.py
- Lines: 37-45
Signature
def encode_src_tgt(self, s: str, t: str, tokenizer: PreTrainedTokenizer) -> tuple:
"""
Tokenizes source+target and creates masked labels.
Args:
s: Formatted source/prompt string.
t: Target/response string.
tokenizer: HuggingFace tokenizer.
Returns:
Tuple of (input_id: Tensor, label: Tensor).
label has source tokens masked with IGNORE_INDEX = -100.
"""
Import
from dataset.sft_dataset import SFTDataset
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| s | str | Yes | Formatted prompt string (from template formatting) |
| t | str | Yes | Response/target string |
| tokenizer | PreTrainedTokenizer | Yes | HuggingFace tokenizer |
Outputs
| Name | Type | Description |
|---|---|---|
| input_id | Tensor | Full tokenized sequence [prompt + response + EOS] |
| label | Tensor | Copy of input_id with prompt tokens set to -100 |
Usage Examples
from dataset.sft_dataset import SFTDataset
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
dataset = SFTDataset.__new__(SFTDataset)
dataset.IGNORE_INDEX = -100
prompt = "### Instruction:\nTell me a joke.\n### Output:\n"
response = "Why did the chicken cross the road?"
input_id, label = dataset.encode_src_tgt(prompt, response, tokenizer)
# label[:len(prompt_tokens)] == -100 (masked)
# label[len(prompt_tokens):] == response token IDs
Related Pages
Implements Principle
Uses Heuristic
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment