Implementation:Hiyouga LLaMA Factory Supervised Processor
| Knowledge Sources | |
|---|---|
| Domains | Data Processing, Supervised Fine-Tuning |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
Dataset processors for supervised fine-tuning (SFT) with support for multi-turn conversations, prompt masking, history masking, and efficient sequence packing.
Description
This module provides two processor classes. SupervisedDatasetProcessor encodes multi-turn conversations by iterating through prompt-response pairs, applying infer_seqlen truncation per turn, and masking prompt tokens with IGNORE_INDEX so the loss is computed only on response tokens. It supports train_on_prompt to include prompt tokens in training, mask_history to train only on the last conversation turn, and efficient_eos for appending EOS tokens. PackedSupervisedDatasetProcessor extends the base by using a greedy knapsack algorithm to bin-pack multiple encoded examples into fixed-length sequences, assigning per-example attention mask indices for neat packing (preventing cross-attention between packed examples) and padding to a consistent cutoff length.
Usage
Use SupervisedDatasetProcessor for standard SFT training workflows. Use PackedSupervisedDatasetProcessor when sequence packing is enabled (via data_args.packing) to maximize GPU utilization by combining shorter examples into single training sequences. These processors are selected automatically by the data loading pipeline based on the training configuration.
Code Reference
Source Location
- Repository: Hiyouga_LLaMA_Factory
- File: src/llamafactory/data/processor/supervised.py
- Lines: 1-203
Signature
@dataclass
class SupervisedDatasetProcessor(DatasetProcessor):
def _encode_data_example(
self,
prompt: list[dict[str, str]],
response: list[dict[str, str]],
system: Optional[str],
tools: Optional[str],
images: list["ImageInput"],
videos: list["VideoInput"],
audios: list["AudioInput"],
) -> tuple[list[int], list[int]]
def preprocess_dataset(self, examples: dict[str, list[Any]]) -> dict[str, list[Any]]
def print_data_example(self, example: dict[str, list[int]]) -> None
@dataclass
class PackedSupervisedDatasetProcessor(SupervisedDatasetProcessor):
def preprocess_dataset(self, examples: dict[str, list[Any]]) -> dict[str, list[Any]]
Import
from llamafactory.data.processor.supervised import SupervisedDatasetProcessor, PackedSupervisedDatasetProcessor
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| examples | dict[str, list[Any]] |
Yes | Batch of raw examples with keys _prompt, _response, _system, _tools, _images, _videos, _audios |
| _prompt[i] | list[dict[str, str]] |
Yes | Conversation prompt messages (must have odd length for multi-turn) |
| _response[i] | list[dict[str, str]] |
Yes | Response messages (must have exactly 1 entry for supervised training) |
Outputs (SupervisedDatasetProcessor)
| Name | Type | Description |
|---|---|---|
| input_ids | list[list[int]] |
Tokenized input sequences containing both prompt and response tokens |
| attention_mask | list[list[int]] |
Attention masks (all ones) |
| labels | list[list[int]] |
Labels with prompt tokens masked as IGNORE_INDEX, response tokens preserved |
Outputs (PackedSupervisedDatasetProcessor)
| Name | Type | Description |
|---|---|---|
| input_ids | list[list[int]] |
Packed tokenized sequences combining multiple examples, padded to cutoff_len |
| attention_mask | list[list[int]] |
Per-example attention mask indices (incremental integers for neat packing, ones otherwise) |
| position_ids | list[list[int]] |
Position IDs reset per packed example within the sequence |
| labels | list[list[int]] |
Packed labels with prompt and padding tokens masked |
Usage Examples
from llamafactory.data.processor.supervised import SupervisedDatasetProcessor
# Standard SFT processing
processor = SupervisedDatasetProcessor(
template=template,
tokenizer=tokenizer,
processor=None,
data_args=data_args,
)
model_inputs = processor.preprocess_dataset(examples)
# model_inputs contains: input_ids, attention_mask, labels
from llamafactory.data.processor.supervised import PackedSupervisedDatasetProcessor
# Packed SFT processing for efficient GPU utilization
packed_processor = PackedSupervisedDatasetProcessor(
template=template,
tokenizer=tokenizer,
processor=None,
data_args=data_args, # data_args.packing = True, data_args.neat_packing = True
)
packed_inputs = packed_processor.preprocess_dataset(examples)
# packed_inputs contains: input_ids, attention_mask, position_ids, labels
Related Pages
- Hiyouga_LLaMA_Factory_Processor_Utils - Provides the DatasetProcessor base class, infer_seqlen, and greedy_knapsack used by this module
- Hiyouga_LLaMA_Factory_Pairwise_Processor - Processor for DPO-style pairwise preference training
- Hiyouga_LLaMA_Factory_Feedback_Processor - Processor for KTO-style feedback training
- Hiyouga_LLaMA_Factory_Data_Args - DataArguments controlling packing, cutoff_len, train_on_prompt, mask_history, and neat_packing