Implementation:Hiyouga LLaMA Factory V1 Data Engine
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Data Processing |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
DataEngine is the central dataset abstraction that loads, indexes, and converts training data from multiple sources and formats into a unified PyTorch Dataset.
Description
The DataEngine class extends PyTorch's Dataset to provide a model-agnostic data loading pipeline. It parses dataset configuration from YAML files (local or from HuggingFace Hub), loads datasets via HuggingFace load_dataset or local file plugins, builds a data index with optional size/weight-based resampling for controlling dataset mixing ratios, and converts raw samples to the standardized internal format using pluggable converters (alpaca, sharegpt, pair). The engine supports both map-style and streaming datasets, though streaming mode has limited index access.
Usage
Use DataEngine to load and prepare training data for the v1 training pipeline. Instantiate it with a dataset path (YAML config, local directory, or HuggingFace Hub identifier) and access samples by index. It serves as the train_dataset parameter for BaseTrainer and its subclasses.
Code Reference
Source Location
- Repository: Hiyouga_LLaMA_Factory
- File: src/llamafactory/v1/core/data_engine.py
- Lines: 1-196
Signature
class DataEngine(Dataset):
def __init__(self, dataset_path: str) -> None: ...
def _get_dataset_info(self) -> None: ...
def _load_dataset(self) -> None: ...
def _build_data_index(self) -> None: ...
def _convert_data_sample(self, raw_sample: dict[str, Any], dataset_name: str) -> Sample: ...
def __len__(self) -> int: ...
def __getitem__(self, index: int | Any) -> Sample | list[Sample]: ...
def __iter__(self) -> Iterable[Sample]: ...
Import
from llamafactory.v1.core.data_engine import DataEngine
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataset_path | str | Yes | Path to dataset configuration. Accepts YAML config files (local or HuggingFace Hub URIs), local file/directory paths, or HuggingFace dataset identifiers. |
| index (__getitem__) | int, slice, or list[int] | Yes | Index for sample access. Integer for single sample, slice or list for multiple samples. |
Outputs
| Name | Type | Description |
|---|---|---|
| __getitem__ return | Sample or list[Sample] | Converted dataset sample(s) in the standardized format with _dataset_name metadata. |
| __len__ return | int | Number of samples in the data index (-1 for streaming datasets). |
Key Attributes
| Name | Type | Description |
|---|---|---|
| path | str | The dataset path provided at initialization. |
| datasets | dict[str, HFDataset] | Dictionary mapping dataset names to loaded HuggingFace datasets. |
| dataset_infos | dict[str, DatasetInfo] | Dictionary mapping dataset names to their configuration metadata. |
| data_index | list[tuple[str, int]] | List of (dataset_name, sample_index) tuples for unified indexing across multiple datasets. |
| streaming | bool | Whether the dataset operates in streaming mode. |
Usage Examples
from llamafactory.v1.core.data_engine import DataEngine
# Load from a YAML config
data_engine = DataEngine("data/v1_sft_demo.yaml")
# Access a single sample
sample = data_engine[0]
# Access multiple samples via slice
samples = data_engine[0:10]
# Get dataset length
print(len(data_engine))
# Use as training dataset
trainer = SFTTrainer(args=training_args, model=model, renderer=renderer, train_dataset=data_engine)
Related Pages
- Hiyouga_LLaMA_Factory_V1_Data_Converter - Pluggable converters used by DataEngine to transform raw samples.
- Hiyouga_LLaMA_Factory_V1_Data_Loader - Local file loading plugin used by DataEngine.
- Hiyouga_LLaMA_Factory_V1_Batching - BatchGenerator that wraps DataEngine for training.
- Hiyouga_LLaMA_Factory_V1_Base_Trainer - Consumes DataEngine as the training dataset.