Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hiyouga LLaMA Factory V1 Data Engine

From Leeroopedia


Knowledge Sources
Domains Machine Learning, Data Processing
Last Updated 2026-02-06 19:00 GMT

Overview

DataEngine is the central dataset abstraction that loads, indexes, and converts training data from multiple sources and formats into a unified PyTorch Dataset.

Description

The DataEngine class extends PyTorch's Dataset to provide a model-agnostic data loading pipeline. It parses dataset configuration from YAML files (local or from HuggingFace Hub), loads datasets via HuggingFace load_dataset or local file plugins, builds a data index with optional size/weight-based resampling for controlling dataset mixing ratios, and converts raw samples to the standardized internal format using pluggable converters (alpaca, sharegpt, pair). The engine supports both map-style and streaming datasets, though streaming mode has limited index access.

Usage

Use DataEngine to load and prepare training data for the v1 training pipeline. Instantiate it with a dataset path (YAML config, local directory, or HuggingFace Hub identifier) and access samples by index. It serves as the train_dataset parameter for BaseTrainer and its subclasses.

Code Reference

Source Location

Signature

class DataEngine(Dataset):
    def __init__(self, dataset_path: str) -> None: ...

    def _get_dataset_info(self) -> None: ...
    def _load_dataset(self) -> None: ...
    def _build_data_index(self) -> None: ...
    def _convert_data_sample(self, raw_sample: dict[str, Any], dataset_name: str) -> Sample: ...

    def __len__(self) -> int: ...
    def __getitem__(self, index: int | Any) -> Sample | list[Sample]: ...
    def __iter__(self) -> Iterable[Sample]: ...

Import

from llamafactory.v1.core.data_engine import DataEngine

I/O Contract

Inputs

Name Type Required Description
dataset_path str Yes Path to dataset configuration. Accepts YAML config files (local or HuggingFace Hub URIs), local file/directory paths, or HuggingFace dataset identifiers.
index (__getitem__) int, slice, or list[int] Yes Index for sample access. Integer for single sample, slice or list for multiple samples.

Outputs

Name Type Description
__getitem__ return Sample or list[Sample] Converted dataset sample(s) in the standardized format with _dataset_name metadata.
__len__ return int Number of samples in the data index (-1 for streaming datasets).

Key Attributes

Name Type Description
path str The dataset path provided at initialization.
datasets dict[str, HFDataset] Dictionary mapping dataset names to loaded HuggingFace datasets.
dataset_infos dict[str, DatasetInfo] Dictionary mapping dataset names to their configuration metadata.
data_index list[tuple[str, int]] List of (dataset_name, sample_index) tuples for unified indexing across multiple datasets.
streaming bool Whether the dataset operates in streaming mode.

Usage Examples

from llamafactory.v1.core.data_engine import DataEngine

# Load from a YAML config
data_engine = DataEngine("data/v1_sft_demo.yaml")

# Access a single sample
sample = data_engine[0]

# Access multiple samples via slice
samples = data_engine[0:10]

# Get dataset length
print(len(data_engine))

# Use as training dataset
trainer = SFTTrainer(args=training_args, model=model, renderer=renderer, train_dataset=data_engine)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment