Implementation:Hiyouga LLaMA Factory Supervised Processor

Knowledge Sources	Hiyouga_LLaMA_Factory
Domains	Data Processing, Supervised Fine-Tuning
Last Updated	2026-02-06 19:00 GMT

Overview

Dataset processors for supervised fine-tuning (SFT) with support for multi-turn conversations, prompt masking, history masking, and efficient sequence packing.

Description

This module provides two processor classes. SupervisedDatasetProcessor encodes multi-turn conversations by iterating through prompt-response pairs, applying infer_seqlen truncation per turn, and masking prompt tokens with IGNORE_INDEX so the loss is computed only on response tokens. It supports train_on_prompt to include prompt tokens in training, mask_history to train only on the last conversation turn, and efficient_eos for appending EOS tokens. PackedSupervisedDatasetProcessor extends the base by using a greedy knapsack algorithm to bin-pack multiple encoded examples into fixed-length sequences, assigning per-example attention mask indices for neat packing (preventing cross-attention between packed examples) and padding to a consistent cutoff length.

Usage

Use SupervisedDatasetProcessor for standard SFT training workflows. Use PackedSupervisedDatasetProcessor when sequence packing is enabled (via data_args.packing) to maximize GPU utilization by combining shorter examples into single training sequences. These processors are selected automatically by the data loading pipeline based on the training configuration.

Code Reference

Source Location

Repository: Hiyouga_LLaMA_Factory
File: src/llamafactory/data/processor/supervised.py
Lines: 1-203

Signature

@dataclass
class SupervisedDatasetProcessor(DatasetProcessor):
    def _encode_data_example(
        self,
        prompt: list[dict[str, str]],
        response: list[dict[str, str]],
        system: Optional[str],
        tools: Optional[str],
        images: list["ImageInput"],
        videos: list["VideoInput"],
        audios: list["AudioInput"],
    ) -> tuple[list[int], list[int]]

    def preprocess_dataset(self, examples: dict[str, list[Any]]) -> dict[str, list[Any]]
    def print_data_example(self, example: dict[str, list[int]]) -> None

@dataclass
class PackedSupervisedDatasetProcessor(SupervisedDatasetProcessor):
    def preprocess_dataset(self, examples: dict[str, list[Any]]) -> dict[str, list[Any]]

Import

from llamafactory.data.processor.supervised import SupervisedDatasetProcessor, PackedSupervisedDatasetProcessor

I/O Contract

Inputs

Name	Type	Required	Description
examples	`dict[str, list[Any]]`	Yes	Batch of raw examples with keys _prompt, _response, _system, _tools, _images, _videos, _audios
_prompt[i]	`list[dict[str, str]]`	Yes	Conversation prompt messages (must have odd length for multi-turn)
_response[i]	`list[dict[str, str]]`	Yes	Response messages (must have exactly 1 entry for supervised training)

Outputs (SupervisedDatasetProcessor)

Name	Type	Description
input_ids	`list[list[int]]`	Tokenized input sequences containing both prompt and response tokens
attention_mask	`list[list[int]]`	Attention masks (all ones)
labels	`list[list[int]]`	Labels with prompt tokens masked as IGNORE_INDEX, response tokens preserved

Outputs (PackedSupervisedDatasetProcessor)

Name	Type	Description
input_ids	`list[list[int]]`	Packed tokenized sequences combining multiple examples, padded to cutoff_len
attention_mask	`list[list[int]]`	Per-example attention mask indices (incremental integers for neat packing, ones otherwise)
position_ids	`list[list[int]]`	Position IDs reset per packed example within the sequence
labels	`list[list[int]]`	Packed labels with prompt and padding tokens masked

Usage Examples

from llamafactory.data.processor.supervised import SupervisedDatasetProcessor

# Standard SFT processing
processor = SupervisedDatasetProcessor(
    template=template,
    tokenizer=tokenizer,
    processor=None,
    data_args=data_args,
)
model_inputs = processor.preprocess_dataset(examples)
# model_inputs contains: input_ids, attention_mask, labels

from llamafactory.data.processor.supervised import PackedSupervisedDatasetProcessor

# Packed SFT processing for efficient GPU utilization
packed_processor = PackedSupervisedDatasetProcessor(
    template=template,
    tokenizer=tokenizer,
    processor=None,
    data_args=data_args,  # data_args.packing = True, data_args.neat_packing = True
)
packed_inputs = packed_processor.preprocess_dataset(examples)
# packed_inputs contains: input_ids, attention_mask, position_ids, labels

Related Pages

Hiyouga_LLaMA_Factory_Processor_Utils - Provides the DatasetProcessor base class, infer_seqlen, and greedy_knapsack used by this module
Hiyouga_LLaMA_Factory_Pairwise_Processor - Processor for DPO-style pairwise preference training
Hiyouga_LLaMA_Factory_Feedback_Processor - Processor for KTO-style feedback training
Hiyouga_LLaMA_Factory_Data_Args - DataArguments controlling packing, cutoff_len, train_on_prompt, mask_history, and neat_packing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment