Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hiyouga LLaMA Factory Data Loader

From Leeroopedia


Knowledge Sources
Domains Data Processing, Training Pipeline
Last Updated 2026-02-06 19:00 GMT

Overview

Concrete end-to-end dataset loading, preprocessing, and tokenization orchestrator provided by LLaMA Factory.

Description

This module is the central data pipeline orchestrator for LLaMA Factory. The primary entry point get_dataset coordinates the entire flow from raw data sources to tokenized, training-ready datasets. The pipeline consists of:

  1. Loading -- _load_single_dataset loads datasets from HuggingFace Hub, ModelScope, OpenMind, local files, cloud storage, or scripts, with support for streaming, sampling, and truncation
  2. Format Alignment -- Each loaded dataset is passed through align_dataset to normalize to the standard schema
  3. Merging -- _get_merged_dataset loads and merges multiple datasets with interleaving support
  4. Processor Selection -- _get_dataset_processor selects the appropriate tokenization processor based on training stage (pt, sft, rm, ppo, kto) and packing mode
  5. Preprocessing -- _get_preprocessed_dataset applies the selected processor via dataset.map()
  6. Splitting -- Train/eval splits are created and optionally saved as pre-tokenized datasets

The module supports six dataset processor types: PretrainDatasetProcessor, SupervisedDatasetProcessor, PackedSupervisedDatasetProcessor, PairwiseDatasetProcessor, FeedbackDatasetProcessor, and UnsupervisedDatasetProcessor.

Usage

get_dataset is called by every training workflow (SFT, DPO, KTO, PPO, PT, RM) at the start of training to prepare the data. It returns a DatasetModule dictionary containing the train and evaluation datasets.

Code Reference

Source Location

Signature

def get_dataset(
    template: "Template",
    model_args: "ModelArguments",
    data_args: "DataArguments",
    training_args: "Seq2SeqTrainingArguments",
    stage: Literal["pt", "sft", "rm", "ppo", "kto"],
    tokenizer: "PreTrainedTokenizer",
    processor: Optional["ProcessorMixin"] = None,
) -> "DatasetModule": ...

def _load_single_dataset(
    dataset_attr: "DatasetAttr",
    model_args: "ModelArguments",
    data_args: "DataArguments",
    training_args: "Seq2SeqTrainingArguments",
) -> Union["Dataset", "IterableDataset"]: ...

def _get_merged_dataset(
    dataset_names: list[str] | None,
    model_args: "ModelArguments",
    data_args: "DataArguments",
    training_args: "Seq2SeqTrainingArguments",
    stage: Literal["pt", "sft", "rm", "ppo", "kto"],
    return_dict: bool = False,
) -> Union["Dataset", "IterableDataset", dict[str, "Dataset"]] | None: ...

def _get_dataset_processor(
    data_args: "DataArguments",
    stage: Literal["pt", "sft", "rm", "ppo", "kto"],
    template: "Template",
    tokenizer: "PreTrainedTokenizer",
    processor: Optional["ProcessorMixin"],
    do_generate: bool = False,
) -> "DatasetProcessor": ...

Import

from llamafactory.data.loader import get_dataset

I/O Contract

Inputs

Name Type Required Description
template Template Yes Chat template for tokenization
model_args ModelArguments Yes Model path and hub tokens for dataset downloading
data_args DataArguments Yes Dataset names, paths, streaming, preprocessing config
training_args Seq2SeqTrainingArguments Yes Training configuration including num_proc, seed, logging
stage str Yes Training stage: "pt", "sft", "rm", "ppo", or "kto"
tokenizer PreTrainedTokenizer Yes Tokenizer for encoding text to token IDs
processor ProcessorMixin No Multimodal processor for VLM models

Outputs

Name Type Description
DatasetModule dict Dictionary with "train_dataset" and optional "eval_dataset" keys containing tokenized HuggingFace datasets

Usage Examples

from llamafactory.data.loader import get_dataset

# Called in training workflows:
dataset_module = get_dataset(
    template=template,
    model_args=model_args,
    data_args=data_args,
    training_args=training_args,
    stage="sft",
    tokenizer=tokenizer,
    processor=processor,
)

train_dataset = dataset_module["train_dataset"]
eval_dataset = dataset_module.get("eval_dataset")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment