Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hiyouga LLaMA Factory Data Parser

From Leeroopedia


Knowledge Sources
Domains Data Processing, Configuration
Last Updated 2026-02-06 19:00 GMT

Overview

Parses dataset configuration metadata and resolves dataset attributes from a JSON configuration file into structured DatasetAttr objects.

Description

This module defines the DatasetAttr dataclass, which holds all configuration attributes for a dataset including its load source, formatting style, column mappings, and ShareGPT tag mappings. The get_dataset_list function reads a dataset_info.json configuration file and resolves each named dataset to a DatasetAttr instance with the correct load source (HuggingFace Hub, ModelScope, OpenMind, script, cloud file, or local file). Column and tag overrides from the configuration are applied via the join method, decoupling dataset specification from the actual loading logic.

Usage

Use this module when you need to resolve a list of dataset names into their full configuration attributes prior to loading. It is called during the data preparation phase of training and evaluation workflows, translating user-specified dataset names (from CLI or YAML) into concrete loading instructions.

Code Reference

Source Location

Signature

@dataclass
class DatasetAttr:
    load_from: Literal["hf_hub", "ms_hub", "om_hub", "script", "file"]
    dataset_name: str
    formatting: Literal["alpaca", "sharegpt", "openai"] = "alpaca"
    ranking: bool = False
    ...
    def set_attr(self, key: str, obj: dict[str, Any], default: Any | None = None) -> None
    def join(self, attr: dict[str, Any]) -> None

def get_dataset_list(dataset_names: list[str] | None, dataset_dir: str | dict) -> list["DatasetAttr"]

Import

from llamafactory.data.parser import DatasetAttr, get_dataset_list

I/O Contract

Inputs

Name Type Required Description
dataset_names None Yes List of dataset names to resolve from the configuration file
dataset_dir dict Yes Path to the directory containing dataset_info.json, the string "ONLINE" for hub-only mode, "REMOTE:<repo_id>" for remote config, or a pre-loaded dict

Outputs

Name Type Description
dataset_list list[DatasetAttr] A list of fully resolved dataset attribute objects, each specifying the load source, formatting, column mappings, and tag mappings

Usage Examples

# Resolve datasets from a local config directory
from llamafactory.data.parser import get_dataset_list

dataset_list = get_dataset_list(
    dataset_names=["alpaca_en", "alpaca_zh"],
    dataset_dir="data"
)
for ds_attr in dataset_list:
    print(f"{ds_attr.dataset_name}: load_from={ds_attr.load_from}, format={ds_attr.formatting}")
# Use ONLINE mode to load datasets directly from HuggingFace Hub
dataset_list = get_dataset_list(
    dataset_names=["tatsu-lab/alpaca"],
    dataset_dir="ONLINE"
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment