Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Alibaba ROLL Dataset Info Registry

From Leeroopedia


Knowledge Sources
Domains Data_Management, Training
Last Updated 2026-02-07 20:00 GMT

Overview

A comprehensive JSON configuration registry that maps dataset names to their file locations, formatting rules, column mappings, and source URLs for use with LLaMA-Factory-style training pipelines.

Description

dataset_info.json is a 626-line JSON configuration file that serves as a centralized dataset registry for the mcore_adapter training examples. Each top-level key is a dataset identifier (e.g., "alpaca_en_demo", "dpo_zh_demo", "ultrafeedback"), and its value is a configuration object describing how to locate, load, and parse the dataset.

The registry supports multiple dataset categories:

  • SFT (Supervised Fine-Tuning) datasets: Alpaca-format instruction/output pairs (e.g., identity, alpaca_en_demo, alpaca_zh_demo)
  • DPO/Preference datasets: Ranked preference data with chosen/rejected fields (e.g., dpo_en_demo, dpo_zh_demo, ultrafeedback, orca_pairs)
  • KTO datasets: Binary feedback data with label tags (e.g., kto_en_demo, kto_mix_en)
  • Tool-calling datasets: ShareGPT-format data with tool annotations (e.g., glaive_toolcall_en_demo)
  • Multimodal datasets: Data with image/video columns (e.g., mllm_demo, llava_1k_en)
  • Pretraining datasets: Raw text corpora (e.g., wiki_demo, refinedweb, pile)

Datasets can be loaded from local files (via file_name), Hugging Face Hub (via hf_hub_url), ModelScope Hub (via ms_hub_url), or custom scripts (via script_url).

Usage

This registry is used by the LLaMA-Factory data loading system to resolve dataset names to their locations and parsing rules during training. Use this file to:

  • Register new datasets for training pipelines
  • Configure column mappings for non-standard data formats
  • Specify formatting rules (Alpaca vs. ShareGPT) and tag conventions
  • Add HuggingFace Hub or ModelScope Hub URLs for remote datasets

Code Reference

Source Location

  • Repository: Alibaba_ROLL
  • File: mcore_adapter/examples/data/dataset_info.json

Data Schema / Signature

{
  "<dataset_name>": {
    "file_name": "string  -- Local file path (optional)",
    "hf_hub_url": "string  -- HuggingFace Hub dataset ID (optional)",
    "ms_hub_url": "string  -- ModelScope Hub dataset ID (optional)",
    "script_url": "string  -- Custom loading script name (optional)",
    "formatting": "string  -- 'sharegpt' for ShareGPT format (optional, default Alpaca)",
    "subset": "string  -- Dataset subset name (optional)",
    "split": "string  -- Dataset split name, e.g. 'train' or 'validation' (optional)",
    "folder": "string  -- Subfolder within the dataset (optional)",
    "ranking": "boolean  -- True for DPO/preference datasets (optional)",
    "columns": {
      "prompt": "string  -- Column name mapping for prompt/instruction",
      "response": "string  -- Column name mapping for response/output",
      "system": "string  -- Column name mapping for system prompt",
      "messages": "string  -- Column name for ShareGPT message lists",
      "chosen": "string  -- Column name for preferred response (DPO)",
      "rejected": "string  -- Column name for rejected response (DPO)",
      "kto_tag": "string  -- Column name for KTO binary label",
      "tools": "string  -- Column name for tool definitions",
      "images": "string  -- Column name for image paths",
      "videos": "string  -- Column name for video paths",
      "history": "string  -- Column name for conversation history"
    },
    "tags": {
      "role_tag": "string  -- Key for speaker role in message objects",
      "content_tag": "string  -- Key for message content",
      "user_tag": "string  -- Value representing user role",
      "assistant_tag": "string  -- Value representing assistant role"
    }
  }
}

I/O Contract

Inputs

Field Type Required Description
dataset_name (key) string Yes Unique identifier for the dataset used to reference it in training configs
file_name string No* Local filename relative to the data directory
hf_hub_url string No* HuggingFace Hub dataset identifier (e.g., "llamafactory/alpaca_en")
ms_hub_url string No* ModelScope Hub dataset identifier
formatting string No Data format: omit for Alpaca, set to "sharegpt" for ShareGPT conversation format
ranking boolean No Set to true for DPO/preference datasets with chosen/rejected pairs
columns object No Column name remapping for non-standard field names
tags object No Role/content tag remapping for ShareGPT-format conversations

*At least one of file_name, hf_hub_url, ms_hub_url, or script_url is required.

Outputs

Consumer Description
LLaMA-Factory data loader Resolves dataset name to location and parses data using specified formatting and column mappings
Training pipeline Receives properly formatted training samples (instruction/response pairs, preference pairs, or conversation turns)

Usage Examples

import json

# Load the dataset registry
with open("mcore_adapter/examples/data/dataset_info.json", "r") as f:
    registry = json.load(f)

# List all registered datasets
print(f"Total datasets registered: {len(registry)}")

# Find all DPO/preference datasets
dpo_datasets = {k: v for k, v in registry.items() if v.get("ranking", False)}
print(f"DPO/preference datasets: {list(dpo_datasets.keys())}")
# Output: ['dpo_en_demo', 'dpo_zh_demo', 'dpo_mix_en', 'dpo_mix_zh',
#          'ultrafeedback', 'rlhf_v', 'vlfeedback', 'orca_pairs',
#          'hh_rlhf_en', 'nectar_rm', 'orca_dpo_de']

# Get configuration for a specific dataset
dpo_zh_config = registry["dpo_zh_demo"]
print(f"File: {dpo_zh_config['file_name']}")
print(f"Format: {dpo_zh_config['formatting']}")
print(f"Columns: {dpo_zh_config['columns']}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment