Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Alibaba ROLL Identity Dataset

From Leeroopedia


Knowledge Sources
Domains NLP, SFT
Last Updated 2026-02-07 20:00 GMT

Overview

An identity and self-introduction template dataset for SFT training that teaches language models their name and creator identity using template variables.

Description

identity.json is a 457-line JSON array of instruction/output pairs designed for Supervised Fine-Tuning (SFT). The dataset teaches a language model to correctly identify itself by name and creator when users ask identity-related questions. Each entry follows the standard Alpaca format with three fields:

  • instruction -- A user question or greeting asking about the model's identity (e.g., "Who are you?", "What is your name?", "Are you ChatGPT?", and Chinese equivalents like "你是谁?", "你好")
  • input -- Always an empty string (no additional context needed)
  • output -- A templated response containing {{name}} and {{author}} placeholders that are replaced at training time with the desired model name and creator

The dataset is bilingual, containing both English and Chinese instruction/output pairs. English responses follow patterns like "I am {{name}}, an AI assistant developed by {{author}}", while Chinese responses use patterns like "我是 {{name}},一个由 {{author}} 开发的 AI 助手".

The dataset covers common identity scenarios:

  • Greetings (hi, hello, 你好, 您好)
  • Direct identity questions (Who are you?, What is your name?, 你是谁?)
  • Negative identity assertions (Are you ChatGPT?, Are you trained by OpenAI?, 你是GPT模型吗?)
  • Self-introduction requests (Can you introduce yourself?, 请介绍一下你自己)
  • Creator attribution (Who created you?, Who built you?, 你的开发者是谁?)

This file is registered in dataset_info.json under the key "identity".

Usage

Use this dataset when performing SFT to customize a model's identity. Before training, the template variables {{name}} and {{author}} should be replaced with the desired values (e.g., name="ROLL-Assistant", author="Alibaba"). This is typically handled automatically by the LLaMA-Factory training pipeline.

Code Reference

Source Location

  • Repository: Alibaba_ROLL
  • File: mcore_adapter/examples/data/identity.json

Data Schema / Signature

[
  {
    "instruction": "string  -- User question or greeting about model identity",
    "input": "string  -- Always empty string",
    "output": "string  -- Templated response with {{name}} and {{author}} placeholders"
  }
]

Example entries:

[
  {
    "instruction": "Who are you?",
    "input": "",
    "output": "I am {{name}}, an AI assistant developed by {{author}}. How can I assist you today?"
  },
  {
    "instruction": "你是谁?",
    "input": "",
    "output": "您好,我是 {{name}},一个由 {{author}} 发明的人工智能助手。我可以回答各种问题,提供实用的建议和帮助,帮助用户完成各种任务。"
  },
  {
    "instruction": "Are you ChatGPT?",
    "input": "",
    "output": "No, I am an AI assistant developed by {{author}}."
  }
]

I/O Contract

Inputs

Field Type Required Description
instruction string Yes The user's identity-related question or greeting (English or Chinese)
input string Yes Additional context; always an empty string in this dataset

Outputs

Field Type Description
output string Templated response containing {{name}} and {{author}} placeholders for model identity customization

Template Variables

Variable Type Description
{{name}} string The desired name of the AI assistant (e.g., "ROLL-Assistant", "Qwen")
{{author}} string The creator or organization name (e.g., "Alibaba", "ROLL Team")

Usage Examples

import json

# Load the identity dataset
with open("mcore_adapter/examples/data/identity.json", "r", encoding="utf-8") as f:
    identity_data = json.load(f)

print(f"Total identity entries: {len(identity_data)}")

# Preview entries
for entry in identity_data[:3]:
    print(f"Q: {entry['instruction']}")
    print(f"A: {entry['output']}")
    print()

# Replace template variables for training
model_name = "ROLL-Assistant"
model_author = "Alibaba"

processed_data = []
for entry in identity_data:
    processed_entry = {
        "instruction": entry["instruction"],
        "input": entry["input"],
        "output": entry["output"]
            .replace("{{name}}", model_name)
            .replace("{{author}}", model_author)
    }
    processed_data.append(processed_entry)

print(f"Sample output: {processed_data[0]['output']}")
# Output: "Hello! I am ROLL-Assistant, an AI assistant developed by Alibaba. ..."

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment