Implementation:Hiyouga LLaMA Factory Data Parser

Knowledge Sources	Hiyouga_LLaMA_Factory
Domains	Data Processing, Configuration
Last Updated	2026-02-06 19:00 GMT

Overview

Parses dataset configuration metadata and resolves dataset attributes from a JSON configuration file into structured DatasetAttr objects.

Description

This module defines the DatasetAttr dataclass, which holds all configuration attributes for a dataset including its load source, formatting style, column mappings, and ShareGPT tag mappings. The get_dataset_list function reads a dataset_info.json configuration file and resolves each named dataset to a DatasetAttr instance with the correct load source (HuggingFace Hub, ModelScope, OpenMind, script, cloud file, or local file). Column and tag overrides from the configuration are applied via the join method, decoupling dataset specification from the actual loading logic.

Usage

Use this module when you need to resolve a list of dataset names into their full configuration attributes prior to loading. It is called during the data preparation phase of training and evaluation workflows, translating user-specified dataset names (from CLI or YAML) into concrete loading instructions.

Code Reference

Source Location

Repository: Hiyouga_LLaMA_Factory
File: src/llamafactory/data/parser.py
Lines: 1-149

Signature

@dataclass
class DatasetAttr:
    load_from: Literal["hf_hub", "ms_hub", "om_hub", "script", "file"]
    dataset_name: str
    formatting: Literal["alpaca", "sharegpt", "openai"] = "alpaca"
    ranking: bool = False
    ...
    def set_attr(self, key: str, obj: dict[str, Any], default: Any | None = None) -> None
    def join(self, attr: dict[str, Any]) -> None

def get_dataset_list(dataset_names: list[str] | None, dataset_dir: str | dict) -> list["DatasetAttr"]

Import

from llamafactory.data.parser import DatasetAttr, get_dataset_list

I/O Contract

Inputs

Name	Type	Required	Description
dataset_names	None	Yes	List of dataset names to resolve from the configuration file
dataset_dir	dict	Yes	Path to the directory containing dataset_info.json, the string "ONLINE" for hub-only mode, "REMOTE:<repo_id>" for remote config, or a pre-loaded dict

Outputs

Name	Type	Description
dataset_list	`list[DatasetAttr]`	A list of fully resolved dataset attribute objects, each specifying the load source, formatting, column mappings, and tag mappings

Usage Examples

# Resolve datasets from a local config directory
from llamafactory.data.parser import get_dataset_list

dataset_list = get_dataset_list(
    dataset_names=["alpaca_en", "alpaca_zh"],
    dataset_dir="data"
)
for ds_attr in dataset_list:
    print(f"{ds_attr.dataset_name}: load_from={ds_attr.load_from}, format={ds_attr.formatting}")

# Use ONLINE mode to load datasets directly from HuggingFace Hub
dataset_list = get_dataset_list(
    dataset_names=["tatsu-lab/alpaca"],
    dataset_dir="ONLINE"
)

Related Pages

Hiyouga_LLaMA_Factory_Data_Args - Defines the DataArguments that specify dataset names and directory passed to get_dataset_list
Hiyouga_LLaMA_Factory_Supervised_Processor - Consumes DatasetAttr objects during supervised fine-tuning data preparation
Hiyouga_LLaMA_Factory_Feedback_Processor - Consumes DatasetAttr objects during KTO data preparation
Hiyouga_LLaMA_Factory_Pairwise_Processor - Consumes DatasetAttr objects during pairwise preference data preparation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment