Implementation:Alibaba ROLL Dataset Info Registry
| Knowledge Sources | |
|---|---|
| Domains | Data_Management, Training |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
A comprehensive JSON configuration registry that maps dataset names to their file locations, formatting rules, column mappings, and source URLs for use with LLaMA-Factory-style training pipelines.
Description
dataset_info.json is a 626-line JSON configuration file that serves as a centralized dataset registry for the mcore_adapter training examples. Each top-level key is a dataset identifier (e.g., "alpaca_en_demo", "dpo_zh_demo", "ultrafeedback"), and its value is a configuration object describing how to locate, load, and parse the dataset.
The registry supports multiple dataset categories:
- SFT (Supervised Fine-Tuning) datasets: Alpaca-format instruction/output pairs (e.g.,
identity,alpaca_en_demo,alpaca_zh_demo) - DPO/Preference datasets: Ranked preference data with chosen/rejected fields (e.g.,
dpo_en_demo,dpo_zh_demo,ultrafeedback,orca_pairs) - KTO datasets: Binary feedback data with label tags (e.g.,
kto_en_demo,kto_mix_en) - Tool-calling datasets: ShareGPT-format data with tool annotations (e.g.,
glaive_toolcall_en_demo) - Multimodal datasets: Data with image/video columns (e.g.,
mllm_demo,llava_1k_en) - Pretraining datasets: Raw text corpora (e.g.,
wiki_demo,refinedweb,pile)
Datasets can be loaded from local files (via file_name), Hugging Face Hub (via hf_hub_url), ModelScope Hub (via ms_hub_url), or custom scripts (via script_url).
Usage
This registry is used by the LLaMA-Factory data loading system to resolve dataset names to their locations and parsing rules during training. Use this file to:
- Register new datasets for training pipelines
- Configure column mappings for non-standard data formats
- Specify formatting rules (Alpaca vs. ShareGPT) and tag conventions
- Add HuggingFace Hub or ModelScope Hub URLs for remote datasets
Code Reference
Source Location
- Repository: Alibaba_ROLL
- File:
mcore_adapter/examples/data/dataset_info.json
Data Schema / Signature
{
"<dataset_name>": {
"file_name": "string -- Local file path (optional)",
"hf_hub_url": "string -- HuggingFace Hub dataset ID (optional)",
"ms_hub_url": "string -- ModelScope Hub dataset ID (optional)",
"script_url": "string -- Custom loading script name (optional)",
"formatting": "string -- 'sharegpt' for ShareGPT format (optional, default Alpaca)",
"subset": "string -- Dataset subset name (optional)",
"split": "string -- Dataset split name, e.g. 'train' or 'validation' (optional)",
"folder": "string -- Subfolder within the dataset (optional)",
"ranking": "boolean -- True for DPO/preference datasets (optional)",
"columns": {
"prompt": "string -- Column name mapping for prompt/instruction",
"response": "string -- Column name mapping for response/output",
"system": "string -- Column name mapping for system prompt",
"messages": "string -- Column name for ShareGPT message lists",
"chosen": "string -- Column name for preferred response (DPO)",
"rejected": "string -- Column name for rejected response (DPO)",
"kto_tag": "string -- Column name for KTO binary label",
"tools": "string -- Column name for tool definitions",
"images": "string -- Column name for image paths",
"videos": "string -- Column name for video paths",
"history": "string -- Column name for conversation history"
},
"tags": {
"role_tag": "string -- Key for speaker role in message objects",
"content_tag": "string -- Key for message content",
"user_tag": "string -- Value representing user role",
"assistant_tag": "string -- Value representing assistant role"
}
}
}
I/O Contract
Inputs
| Field | Type | Required | Description |
|---|---|---|---|
| dataset_name (key) | string | Yes | Unique identifier for the dataset used to reference it in training configs |
| file_name | string | No* | Local filename relative to the data directory |
| hf_hub_url | string | No* | HuggingFace Hub dataset identifier (e.g., "llamafactory/alpaca_en") |
| ms_hub_url | string | No* | ModelScope Hub dataset identifier |
| formatting | string | No | Data format: omit for Alpaca, set to "sharegpt" for ShareGPT conversation format |
| ranking | boolean | No | Set to true for DPO/preference datasets with chosen/rejected pairs |
| columns | object | No | Column name remapping for non-standard field names |
| tags | object | No | Role/content tag remapping for ShareGPT-format conversations |
*At least one of file_name, hf_hub_url, ms_hub_url, or script_url is required.
Outputs
| Consumer | Description |
|---|---|
| LLaMA-Factory data loader | Resolves dataset name to location and parses data using specified formatting and column mappings |
| Training pipeline | Receives properly formatted training samples (instruction/response pairs, preference pairs, or conversation turns) |
Usage Examples
import json
# Load the dataset registry
with open("mcore_adapter/examples/data/dataset_info.json", "r") as f:
registry = json.load(f)
# List all registered datasets
print(f"Total datasets registered: {len(registry)}")
# Find all DPO/preference datasets
dpo_datasets = {k: v for k, v in registry.items() if v.get("ranking", False)}
print(f"DPO/preference datasets: {list(dpo_datasets.keys())}")
# Output: ['dpo_en_demo', 'dpo_zh_demo', 'dpo_mix_en', 'dpo_mix_zh',
# 'ultrafeedback', 'rlhf_v', 'vlfeedback', 'orca_pairs',
# 'hh_rlhf_en', 'nectar_rm', 'orca_dpo_de']
# Get configuration for a specific dataset
dpo_zh_config = registry["dpo_zh_demo"]
print(f"File: {dpo_zh_config['file_name']}")
print(f"Format: {dpo_zh_config['formatting']}")
print(f"Columns: {dpo_zh_config['columns']}")