Implementation:Alibaba ROLL Identity Dataset

Knowledge Sources	Alibaba_ROLL
Domains	NLP, SFT
Last Updated	2026-02-07 20:00 GMT

Overview

An identity and self-introduction template dataset for SFT training that teaches language models their name and creator identity using template variables.

Description

identity.json is a 457-line JSON array of instruction/output pairs designed for Supervised Fine-Tuning (SFT). The dataset teaches a language model to correctly identify itself by name and creator when users ask identity-related questions. Each entry follows the standard Alpaca format with three fields:

instruction -- A user question or greeting asking about the model's identity (e.g., "Who are you?", "What is your name?", "Are you ChatGPT?", and Chinese equivalents like "你是谁？", "你好")
input -- Always an empty string (no additional context needed)
output -- A templated response containing {{name}} and {{author}} placeholders that are replaced at training time with the desired model name and creator

The dataset is bilingual, containing both English and Chinese instruction/output pairs. English responses follow patterns like "I am {{name}}, an AI assistant developed by {{author}}", while Chinese responses use patterns like "我是 {{name}}，一个由 {{author}} 开发的 AI 助手".

The dataset covers common identity scenarios:

Greetings (hi, hello, 你好, 您好)
Direct identity questions (Who are you?, What is your name?, 你是谁？)
Negative identity assertions (Are you ChatGPT?, Are you trained by OpenAI?, 你是GPT模型吗？)
Self-introduction requests (Can you introduce yourself?, 请介绍一下你自己)
Creator attribution (Who created you?, Who built you?, 你的开发者是谁？)

This file is registered in dataset_info.json under the key "identity".

Usage

Use this dataset when performing SFT to customize a model's identity. Before training, the template variables {{name}} and {{author}} should be replaced with the desired values (e.g., name="ROLL-Assistant", author="Alibaba"). This is typically handled automatically by the LLaMA-Factory training pipeline.

Code Reference

Source Location

Repository: Alibaba_ROLL
File: mcore_adapter/examples/data/identity.json

Data Schema / Signature

[
  {
    "instruction": "string  -- User question or greeting about model identity",
    "input": "string  -- Always empty string",
    "output": "string  -- Templated response with {{name}} and {{author}} placeholders"
  }
]

Example entries:

[
  {
    "instruction": "Who are you?",
    "input": "",
    "output": "I am {{name}}, an AI assistant developed by {{author}}. How can I assist you today?"
  },
  {
    "instruction": "你是谁？",
    "input": "",
    "output": "您好，我是 {{name}}，一个由 {{author}} 发明的人工智能助手。我可以回答各种问题，提供实用的建议和帮助，帮助用户完成各种任务。"
  },
  {
    "instruction": "Are you ChatGPT?",
    "input": "",
    "output": "No, I am an AI assistant developed by {{author}}."
  }
]

I/O Contract

Inputs

Field	Type	Required	Description
instruction	string	Yes	The user's identity-related question or greeting (English or Chinese)
input	string	Yes	Additional context; always an empty string in this dataset

Outputs

Field	Type	Description
output	string	Templated response containing `{{name}}` and `{{author}}` placeholders for model identity customization

Template Variables

Variable	Type	Description
`{{name}}`	string	The desired name of the AI assistant (e.g., "ROLL-Assistant", "Qwen")
`{{author}}`	string	The creator or organization name (e.g., "Alibaba", "ROLL Team")

Usage Examples

import json

# Load the identity dataset
with open("mcore_adapter/examples/data/identity.json", "r", encoding="utf-8") as f:
    identity_data = json.load(f)

print(f"Total identity entries: {len(identity_data)}")

# Preview entries
for entry in identity_data[:3]:
    print(f"Q: {entry['instruction']}")
    print(f"A: {entry['output']}")
    print()

# Replace template variables for training
model_name = "ROLL-Assistant"
model_author = "Alibaba"

processed_data = []
for entry in identity_data:
    processed_entry = {
        "instruction": entry["instruction"],
        "input": entry["input"],
        "output": entry["output"]
            .replace("{{name}}", model_name)
            .replace("{{author}}", model_author)
    }
    processed_data.append(processed_entry)

print(f"Sample output: {processed_data[0]['output']}")
# Output: "Hello! I am ROLL-Assistant, an AI assistant developed by Alibaba. ..."

Related Pages

Environment:Alibaba_ROLL_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment