Implementation:Hiyouga LLaMA Factory Data Parser
| Knowledge Sources | |
|---|---|
| Domains | Data Processing, Configuration |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
Parses dataset configuration metadata and resolves dataset attributes from a JSON configuration file into structured DatasetAttr objects.
Description
This module defines the DatasetAttr dataclass, which holds all configuration attributes for a dataset including its load source, formatting style, column mappings, and ShareGPT tag mappings. The get_dataset_list function reads a dataset_info.json configuration file and resolves each named dataset to a DatasetAttr instance with the correct load source (HuggingFace Hub, ModelScope, OpenMind, script, cloud file, or local file). Column and tag overrides from the configuration are applied via the join method, decoupling dataset specification from the actual loading logic.
Usage
Use this module when you need to resolve a list of dataset names into their full configuration attributes prior to loading. It is called during the data preparation phase of training and evaluation workflows, translating user-specified dataset names (from CLI or YAML) into concrete loading instructions.
Code Reference
Source Location
- Repository: Hiyouga_LLaMA_Factory
- File: src/llamafactory/data/parser.py
- Lines: 1-149
Signature
@dataclass
class DatasetAttr:
load_from: Literal["hf_hub", "ms_hub", "om_hub", "script", "file"]
dataset_name: str
formatting: Literal["alpaca", "sharegpt", "openai"] = "alpaca"
ranking: bool = False
...
def set_attr(self, key: str, obj: dict[str, Any], default: Any | None = None) -> None
def join(self, attr: dict[str, Any]) -> None
def get_dataset_list(dataset_names: list[str] | None, dataset_dir: str | dict) -> list["DatasetAttr"]
Import
from llamafactory.data.parser import DatasetAttr, get_dataset_list
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataset_names | None | Yes | List of dataset names to resolve from the configuration file |
| dataset_dir | dict | Yes | Path to the directory containing dataset_info.json, the string "ONLINE" for hub-only mode, "REMOTE:<repo_id>" for remote config, or a pre-loaded dict |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset_list | list[DatasetAttr] |
A list of fully resolved dataset attribute objects, each specifying the load source, formatting, column mappings, and tag mappings |
Usage Examples
# Resolve datasets from a local config directory
from llamafactory.data.parser import get_dataset_list
dataset_list = get_dataset_list(
dataset_names=["alpaca_en", "alpaca_zh"],
dataset_dir="data"
)
for ds_attr in dataset_list:
print(f"{ds_attr.dataset_name}: load_from={ds_attr.load_from}, format={ds_attr.formatting}")
# Use ONLINE mode to load datasets directly from HuggingFace Hub
dataset_list = get_dataset_list(
dataset_names=["tatsu-lab/alpaca"],
dataset_dir="ONLINE"
)
Related Pages
- Hiyouga_LLaMA_Factory_Data_Args - Defines the DataArguments that specify dataset names and directory passed to get_dataset_list
- Hiyouga_LLaMA_Factory_Supervised_Processor - Consumes DatasetAttr objects during supervised fine-tuning data preparation
- Hiyouga_LLaMA_Factory_Feedback_Processor - Consumes DatasetAttr objects during KTO data preparation
- Hiyouga_LLaMA_Factory_Pairwise_Processor - Consumes DatasetAttr objects during pairwise preference data preparation