Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hiyouga LLaMA Factory Supervised Processor

From Leeroopedia


Knowledge Sources
Domains Data Processing, Supervised Fine-Tuning
Last Updated 2026-02-06 19:00 GMT

Overview

Dataset processors for supervised fine-tuning (SFT) with support for multi-turn conversations, prompt masking, history masking, and efficient sequence packing.

Description

This module provides two processor classes. SupervisedDatasetProcessor encodes multi-turn conversations by iterating through prompt-response pairs, applying infer_seqlen truncation per turn, and masking prompt tokens with IGNORE_INDEX so the loss is computed only on response tokens. It supports train_on_prompt to include prompt tokens in training, mask_history to train only on the last conversation turn, and efficient_eos for appending EOS tokens. PackedSupervisedDatasetProcessor extends the base by using a greedy knapsack algorithm to bin-pack multiple encoded examples into fixed-length sequences, assigning per-example attention mask indices for neat packing (preventing cross-attention between packed examples) and padding to a consistent cutoff length.

Usage

Use SupervisedDatasetProcessor for standard SFT training workflows. Use PackedSupervisedDatasetProcessor when sequence packing is enabled (via data_args.packing) to maximize GPU utilization by combining shorter examples into single training sequences. These processors are selected automatically by the data loading pipeline based on the training configuration.

Code Reference

Source Location

Signature

@dataclass
class SupervisedDatasetProcessor(DatasetProcessor):
    def _encode_data_example(
        self,
        prompt: list[dict[str, str]],
        response: list[dict[str, str]],
        system: Optional[str],
        tools: Optional[str],
        images: list["ImageInput"],
        videos: list["VideoInput"],
        audios: list["AudioInput"],
    ) -> tuple[list[int], list[int]]

    def preprocess_dataset(self, examples: dict[str, list[Any]]) -> dict[str, list[Any]]
    def print_data_example(self, example: dict[str, list[int]]) -> None

@dataclass
class PackedSupervisedDatasetProcessor(SupervisedDatasetProcessor):
    def preprocess_dataset(self, examples: dict[str, list[Any]]) -> dict[str, list[Any]]

Import

from llamafactory.data.processor.supervised import SupervisedDatasetProcessor, PackedSupervisedDatasetProcessor

I/O Contract

Inputs

Name Type Required Description
examples dict[str, list[Any]] Yes Batch of raw examples with keys _prompt, _response, _system, _tools, _images, _videos, _audios
_prompt[i] list[dict[str, str]] Yes Conversation prompt messages (must have odd length for multi-turn)
_response[i] list[dict[str, str]] Yes Response messages (must have exactly 1 entry for supervised training)

Outputs (SupervisedDatasetProcessor)

Name Type Description
input_ids list[list[int]] Tokenized input sequences containing both prompt and response tokens
attention_mask list[list[int]] Attention masks (all ones)
labels list[list[int]] Labels with prompt tokens masked as IGNORE_INDEX, response tokens preserved

Outputs (PackedSupervisedDatasetProcessor)

Name Type Description
input_ids list[list[int]] Packed tokenized sequences combining multiple examples, padded to cutoff_len
attention_mask list[list[int]] Per-example attention mask indices (incremental integers for neat packing, ones otherwise)
position_ids list[list[int]] Position IDs reset per packed example within the sequence
labels list[list[int]] Packed labels with prompt and padding tokens masked

Usage Examples

from llamafactory.data.processor.supervised import SupervisedDatasetProcessor

# Standard SFT processing
processor = SupervisedDatasetProcessor(
    template=template,
    tokenizer=tokenizer,
    processor=None,
    data_args=data_args,
)
model_inputs = processor.preprocess_dataset(examples)
# model_inputs contains: input_ids, attention_mask, labels
from llamafactory.data.processor.supervised import PackedSupervisedDatasetProcessor

# Packed SFT processing for efficient GPU utilization
packed_processor = PackedSupervisedDatasetProcessor(
    template=template,
    tokenizer=tokenizer,
    processor=None,
    data_args=data_args,  # data_args.packing = True, data_args.neat_packing = True
)
packed_inputs = packed_processor.preprocess_dataset(examples)
# packed_inputs contains: input_ids, attention_mask, position_ids, labels

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment