Implementation:Hiyouga LLaMA Factory V1 Data Loader

Knowledge Sources	Hiyouga_LLaMA_Factory
Domains	Machine Learning, Data Processing, Plugin Architecture
Last Updated	2026-02-06 19:00 GMT

Overview

DataLoaderPlugin provides local dataset loading from files in various formats (arrow, csv, json, parquet, text) along with data index resampling and selection utilities.

Description

The data loader module defines DataLoaderPlugin (extending BasePlugin) with a registered "local" loader, plus two standalone utility functions. The load_data_from_file function detects file types by extension and loads datasets using HuggingFace's load_dataset with the appropriate builder name, supporting both individual files and directories. It can convert map-style datasets to iterable datasets for streaming. The adjust_data_index function enables dataset resampling by absolute size or proportional weight using random.choices, which is essential for controlling dataset mixing ratios in multi-dataset training. The select_data_sample function supports slice-based and list-based index selection for flexible data access patterns.

Usage

DataLoaderPlugin is invoked by DataEngine when a dataset's configuration specifies source: "local". The adjust_data_index and select_data_sample functions are called by DataEngine for index manipulation during data index building and sample retrieval respectively. To add a new data source, register a new loader with @DataLoaderPlugin("source_name").register().

Code Reference

Source Location

Repository: Hiyouga_LLaMA_Factory
File: src/llamafactory/v1/plugins/data_plugins/loader.py
Lines: 1-108

Signature

class DataLoaderPlugin(BasePlugin):
    def load(self, dataset_info: DatasetInfo) -> HFDataset: ...

@DataLoaderPlugin("local").register()
def load_data_from_file(filepath: str, split: str, streaming: bool) -> HFDataset: ...

def adjust_data_index(
    data_index: list[tuple[str, int]],
    size: int | None,
    weight: float | None,
) -> list[tuple[str, int]]: ...

def select_data_sample(
    data_index: list[tuple[str, int]],
    index: slice | list[int] | Any,
) -> tuple[str, int] | list[tuple[str, int]]: ...

Import

from llamafactory.v1.plugins.data_plugins.loader import DataLoaderPlugin, load_data_from_file, adjust_data_index, select_data_sample

I/O Contract

Inputs (DataLoaderPlugin.load)

Name	Type	Required	Description
dataset_info	DatasetInfo	Yes	Dictionary containing "path" (file or directory path), optional "split" (default: "train"), and optional "streaming" (default: False).

Inputs (load_data_from_file)

Name	Type	Required	Description
filepath	str	Yes	Path to a local file or directory containing dataset files. Supported extensions: arrow, csv, json, jsonl, parquet, txt.
split	str	Yes	Dataset split to load (e.g., "train").
streaming	bool	Yes	Whether to convert the dataset to an iterable (streaming) dataset.

Inputs (adjust_data_index)

Name	Type	Required	Description
data_index	list[tuple[str, int]]	Yes	List of (dataset_name, sample_index) tuples.
size	int or None	No	Desired absolute dataset size. Uses random.choices to resample.
weight	float or None	No	Desired proportional weight multiplier. Resamples to len(data_index) * weight.

Inputs (select_data_sample)

Name	Type	Required	Description
data_index	list[tuple[str, int]]	Yes	List of (dataset_name, sample_index) tuples.
index	slice or list[int]	Yes	A slice object or list of integer indices for sample selection.

Outputs

Name	Type	Description
load_data_from_file return	HFDataset	A HuggingFace Dataset (or IterableDataset if streaming=True).
adjust_data_index return	list[tuple[str, int]]	Resampled data index with the desired size or weight.
select_data_sample return	tuple[str, int] or list[tuple[str, int]]	Selected data index entries for the given index.

Usage Examples

from llamafactory.v1.plugins.data_plugins.loader import (
    DataLoaderPlugin, adjust_data_index, select_data_sample
)

# Load a local dataset via plugin
loader = DataLoaderPlugin("local")
dataset = loader.load({"path": "data/train.json", "split": "train", "source": "local"})

# Resample a data index to a specific size
data_index = [("dataset_a", i) for i in range(1000)]
resampled = adjust_data_index(data_index, size=500, weight=None)
print(len(resampled))  # 500

# Resample by weight (double the dataset)
resampled = adjust_data_index(data_index, size=None, weight=2.0)
print(len(resampled))  # 2000

# Select samples by slice
selected = select_data_sample(data_index, slice(0, 10))
print(len(selected))  # 10

# Select samples by index list
selected = select_data_sample(data_index, [0, 5, 10])
print(len(selected))  # 3

Related Pages

Hiyouga_LLaMA_Factory_V1_Data_Engine - The primary consumer that uses DataLoaderPlugin for dataset loading and index utilities.
Hiyouga_LLaMA_Factory_V1_Data_Converter - Sibling plugin for format conversion of loaded data.
Hiyouga_LLaMA_Factory_V1_Batching - Downstream consumer of the loaded and indexed data.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment