Implementation:Hiyouga LLaMA Factory V1 Data Engine

Knowledge Sources	Hiyouga_LLaMA_Factory
Domains	Machine Learning, Data Processing
Last Updated	2026-02-06 19:00 GMT

Overview

DataEngine is the central dataset abstraction that loads, indexes, and converts training data from multiple sources and formats into a unified PyTorch Dataset.

Description

The DataEngine class extends PyTorch's Dataset to provide a model-agnostic data loading pipeline. It parses dataset configuration from YAML files (local or from HuggingFace Hub), loads datasets via HuggingFace load_dataset or local file plugins, builds a data index with optional size/weight-based resampling for controlling dataset mixing ratios, and converts raw samples to the standardized internal format using pluggable converters (alpaca, sharegpt, pair). The engine supports both map-style and streaming datasets, though streaming mode has limited index access.

Usage

Use DataEngine to load and prepare training data for the v1 training pipeline. Instantiate it with a dataset path (YAML config, local directory, or HuggingFace Hub identifier) and access samples by index. It serves as the train_dataset parameter for BaseTrainer and its subclasses.

Code Reference

Source Location

Repository: Hiyouga_LLaMA_Factory
File: src/llamafactory/v1/core/data_engine.py
Lines: 1-196

Signature

class DataEngine(Dataset):
    def __init__(self, dataset_path: str) -> None: ...

    def _get_dataset_info(self) -> None: ...
    def _load_dataset(self) -> None: ...
    def _build_data_index(self) -> None: ...
    def _convert_data_sample(self, raw_sample: dict[str, Any], dataset_name: str) -> Sample: ...

    def __len__(self) -> int: ...
    def __getitem__(self, index: int | Any) -> Sample | list[Sample]: ...
    def __iter__(self) -> Iterable[Sample]: ...

Import

from llamafactory.v1.core.data_engine import DataEngine

I/O Contract

Inputs

Name	Type	Required	Description
dataset_path	str	Yes	Path to dataset configuration. Accepts YAML config files (local or HuggingFace Hub URIs), local file/directory paths, or HuggingFace dataset identifiers.
index (__getitem__)	int, slice, or list[int]	Yes	Index for sample access. Integer for single sample, slice or list for multiple samples.

Outputs

Name	Type	Description
__getitem__ return	Sample or list[Sample]	Converted dataset sample(s) in the standardized format with _dataset_name metadata.
__len__ return	int	Number of samples in the data index (-1 for streaming datasets).

Key Attributes

Name	Type	Description
path	str	The dataset path provided at initialization.
datasets	dict[str, HFDataset]	Dictionary mapping dataset names to loaded HuggingFace datasets.
dataset_infos	dict[str, DatasetInfo]	Dictionary mapping dataset names to their configuration metadata.
data_index	list[tuple[str, int]]	List of (dataset_name, sample_index) tuples for unified indexing across multiple datasets.
streaming	bool	Whether the dataset operates in streaming mode.

Usage Examples

from llamafactory.v1.core.data_engine import DataEngine

# Load from a YAML config
data_engine = DataEngine("data/v1_sft_demo.yaml")

# Access a single sample
sample = data_engine[0]

# Access multiple samples via slice
samples = data_engine[0:10]

# Get dataset length
print(len(data_engine))

# Use as training dataset
trainer = SFTTrainer(args=training_args, model=model, renderer=renderer, train_dataset=data_engine)

Related Pages

Hiyouga_LLaMA_Factory_V1_Data_Converter - Pluggable converters used by DataEngine to transform raw samples.
Hiyouga_LLaMA_Factory_V1_Data_Loader - Local file loading plugin used by DataEngine.
Hiyouga_LLaMA_Factory_V1_Batching - BatchGenerator that wraps DataEngine for training.
Hiyouga_LLaMA_Factory_V1_Base_Trainer - Consumes DataEngine as the training dataset.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment