Overview
DataLoaderPlugin provides local dataset loading from files in various formats (arrow, csv, json, parquet, text) along with data index resampling and selection utilities.
Description
The data loader module defines DataLoaderPlugin (extending BasePlugin) with a registered "local" loader, plus two standalone utility functions. The load_data_from_file function detects file types by extension and loads datasets using HuggingFace's load_dataset with the appropriate builder name, supporting both individual files and directories. It can convert map-style datasets to iterable datasets for streaming. The adjust_data_index function enables dataset resampling by absolute size or proportional weight using random.choices, which is essential for controlling dataset mixing ratios in multi-dataset training. The select_data_sample function supports slice-based and list-based index selection for flexible data access patterns.
Usage
DataLoaderPlugin is invoked by DataEngine when a dataset's configuration specifies source: "local". The adjust_data_index and select_data_sample functions are called by DataEngine for index manipulation during data index building and sample retrieval respectively. To add a new data source, register a new loader with @DataLoaderPlugin("source_name").register().
Code Reference
Source Location
Signature
class DataLoaderPlugin(BasePlugin):
def load(self, dataset_info: DatasetInfo) -> HFDataset: ...
@DataLoaderPlugin("local").register()
def load_data_from_file(filepath: str, split: str, streaming: bool) -> HFDataset: ...
def adjust_data_index(
data_index: list[tuple[str, int]],
size: int | None,
weight: float | None,
) -> list[tuple[str, int]]: ...
def select_data_sample(
data_index: list[tuple[str, int]],
index: slice | list[int] | Any,
) -> tuple[str, int] | list[tuple[str, int]]: ...
Import
from llamafactory.v1.plugins.data_plugins.loader import DataLoaderPlugin, load_data_from_file, adjust_data_index, select_data_sample
I/O Contract
Inputs (DataLoaderPlugin.load)
| Name |
Type |
Required |
Description
|
| dataset_info |
DatasetInfo |
Yes |
Dictionary containing "path" (file or directory path), optional "split" (default: "train"), and optional "streaming" (default: False).
|
Inputs (load_data_from_file)
| Name |
Type |
Required |
Description
|
| filepath |
str |
Yes |
Path to a local file or directory containing dataset files. Supported extensions: arrow, csv, json, jsonl, parquet, txt.
|
| split |
str |
Yes |
Dataset split to load (e.g., "train").
|
| streaming |
bool |
Yes |
Whether to convert the dataset to an iterable (streaming) dataset.
|
Inputs (adjust_data_index)
| Name |
Type |
Required |
Description
|
| data_index |
list[tuple[str, int]] |
Yes |
List of (dataset_name, sample_index) tuples.
|
| size |
int or None |
No |
Desired absolute dataset size. Uses random.choices to resample.
|
| weight |
float or None |
No |
Desired proportional weight multiplier. Resamples to len(data_index) * weight.
|
Inputs (select_data_sample)
| Name |
Type |
Required |
Description
|
| data_index |
list[tuple[str, int]] |
Yes |
List of (dataset_name, sample_index) tuples.
|
| index |
slice or list[int] |
Yes |
A slice object or list of integer indices for sample selection.
|
Outputs
| Name |
Type |
Description
|
| load_data_from_file return |
HFDataset |
A HuggingFace Dataset (or IterableDataset if streaming=True).
|
| adjust_data_index return |
list[tuple[str, int]] |
Resampled data index with the desired size or weight.
|
| select_data_sample return |
tuple[str, int] or list[tuple[str, int]] |
Selected data index entries for the given index.
|
Usage Examples
from llamafactory.v1.plugins.data_plugins.loader import (
DataLoaderPlugin, adjust_data_index, select_data_sample
)
# Load a local dataset via plugin
loader = DataLoaderPlugin("local")
dataset = loader.load({"path": "data/train.json", "split": "train", "source": "local"})
# Resample a data index to a specific size
data_index = [("dataset_a", i) for i in range(1000)]
resampled = adjust_data_index(data_index, size=500, weight=None)
print(len(resampled)) # 500
# Resample by weight (double the dataset)
resampled = adjust_data_index(data_index, size=None, weight=2.0)
print(len(resampled)) # 2000
# Select samples by slice
selected = select_data_sample(data_index, slice(0, 10))
print(len(selected)) # 10
# Select samples by index list
selected = select_data_sample(data_index, [0, 5, 10])
print(len(selected)) # 3
Related Pages