Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hiyouga LLaMA Factory V1 Data Loader

From Leeroopedia
Revision as of 15:07, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Hiyouga_LLaMA_Factory_V1_Data_Loader.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Machine Learning, Data Processing, Plugin Architecture
Last Updated 2026-02-06 19:00 GMT

Overview

DataLoaderPlugin provides local dataset loading from files in various formats (arrow, csv, json, parquet, text) along with data index resampling and selection utilities.

Description

The data loader module defines DataLoaderPlugin (extending BasePlugin) with a registered "local" loader, plus two standalone utility functions. The load_data_from_file function detects file types by extension and loads datasets using HuggingFace's load_dataset with the appropriate builder name, supporting both individual files and directories. It can convert map-style datasets to iterable datasets for streaming. The adjust_data_index function enables dataset resampling by absolute size or proportional weight using random.choices, which is essential for controlling dataset mixing ratios in multi-dataset training. The select_data_sample function supports slice-based and list-based index selection for flexible data access patterns.

Usage

DataLoaderPlugin is invoked by DataEngine when a dataset's configuration specifies source: "local". The adjust_data_index and select_data_sample functions are called by DataEngine for index manipulation during data index building and sample retrieval respectively. To add a new data source, register a new loader with @DataLoaderPlugin("source_name").register().

Code Reference

Source Location

Signature

class DataLoaderPlugin(BasePlugin):
    def load(self, dataset_info: DatasetInfo) -> HFDataset: ...

@DataLoaderPlugin("local").register()
def load_data_from_file(filepath: str, split: str, streaming: bool) -> HFDataset: ...

def adjust_data_index(
    data_index: list[tuple[str, int]],
    size: int | None,
    weight: float | None,
) -> list[tuple[str, int]]: ...

def select_data_sample(
    data_index: list[tuple[str, int]],
    index: slice | list[int] | Any,
) -> tuple[str, int] | list[tuple[str, int]]: ...

Import

from llamafactory.v1.plugins.data_plugins.loader import DataLoaderPlugin, load_data_from_file, adjust_data_index, select_data_sample

I/O Contract

Inputs (DataLoaderPlugin.load)

Name Type Required Description
dataset_info DatasetInfo Yes Dictionary containing "path" (file or directory path), optional "split" (default: "train"), and optional "streaming" (default: False).

Inputs (load_data_from_file)

Name Type Required Description
filepath str Yes Path to a local file or directory containing dataset files. Supported extensions: arrow, csv, json, jsonl, parquet, txt.
split str Yes Dataset split to load (e.g., "train").
streaming bool Yes Whether to convert the dataset to an iterable (streaming) dataset.

Inputs (adjust_data_index)

Name Type Required Description
data_index list[tuple[str, int]] Yes List of (dataset_name, sample_index) tuples.
size int or None No Desired absolute dataset size. Uses random.choices to resample.
weight float or None No Desired proportional weight multiplier. Resamples to len(data_index) * weight.

Inputs (select_data_sample)

Name Type Required Description
data_index list[tuple[str, int]] Yes List of (dataset_name, sample_index) tuples.
index slice or list[int] Yes A slice object or list of integer indices for sample selection.

Outputs

Name Type Description
load_data_from_file return HFDataset A HuggingFace Dataset (or IterableDataset if streaming=True).
adjust_data_index return list[tuple[str, int]] Resampled data index with the desired size or weight.
select_data_sample return tuple[str, int] or list[tuple[str, int]] Selected data index entries for the given index.

Usage Examples

from llamafactory.v1.plugins.data_plugins.loader import (
    DataLoaderPlugin, adjust_data_index, select_data_sample
)

# Load a local dataset via plugin
loader = DataLoaderPlugin("local")
dataset = loader.load({"path": "data/train.json", "split": "train", "source": "local"})

# Resample a data index to a specific size
data_index = [("dataset_a", i) for i in range(1000)]
resampled = adjust_data_index(data_index, size=500, weight=None)
print(len(resampled))  # 500

# Resample by weight (double the dataset)
resampled = adjust_data_index(data_index, size=None, weight=2.0)
print(len(resampled))  # 2000

# Select samples by slice
selected = select_data_sample(data_index, slice(0, 10))
print(len(selected))  # 10

# Select samples by index list
selected = select_data_sample(data_index, [0, 5, 10])
print(len(selected))  # 3

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment