Implementation:Alibaba ROLL Identity Dataset
| Knowledge Sources | |
|---|---|
| Domains | NLP, SFT |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
An identity and self-introduction template dataset for SFT training that teaches language models their name and creator identity using template variables.
Description
identity.json is a 457-line JSON array of instruction/output pairs designed for Supervised Fine-Tuning (SFT). The dataset teaches a language model to correctly identify itself by name and creator when users ask identity-related questions. Each entry follows the standard Alpaca format with three fields:
- instruction -- A user question or greeting asking about the model's identity (e.g., "Who are you?", "What is your name?", "Are you ChatGPT?", and Chinese equivalents like "你是谁?", "你好")
- input -- Always an empty string (no additional context needed)
- output -- A templated response containing
{{name}}and{{author}}placeholders that are replaced at training time with the desired model name and creator
The dataset is bilingual, containing both English and Chinese instruction/output pairs. English responses follow patterns like "I am {{name}}, an AI assistant developed by {{author}}", while Chinese responses use patterns like "我是 {{name}},一个由 {{author}} 开发的 AI 助手".
The dataset covers common identity scenarios:
- Greetings (hi, hello, 你好, 您好)
- Direct identity questions (Who are you?, What is your name?, 你是谁?)
- Negative identity assertions (Are you ChatGPT?, Are you trained by OpenAI?, 你是GPT模型吗?)
- Self-introduction requests (Can you introduce yourself?, 请介绍一下你自己)
- Creator attribution (Who created you?, Who built you?, 你的开发者是谁?)
This file is registered in dataset_info.json under the key "identity".
Usage
Use this dataset when performing SFT to customize a model's identity. Before training, the template variables {{name}} and {{author}} should be replaced with the desired values (e.g., name="ROLL-Assistant", author="Alibaba"). This is typically handled automatically by the LLaMA-Factory training pipeline.
Code Reference
Source Location
- Repository: Alibaba_ROLL
- File:
mcore_adapter/examples/data/identity.json
Data Schema / Signature
[
{
"instruction": "string -- User question or greeting about model identity",
"input": "string -- Always empty string",
"output": "string -- Templated response with {{name}} and {{author}} placeholders"
}
]
Example entries:
[
{
"instruction": "Who are you?",
"input": "",
"output": "I am {{name}}, an AI assistant developed by {{author}}. How can I assist you today?"
},
{
"instruction": "你是谁?",
"input": "",
"output": "您好,我是 {{name}},一个由 {{author}} 发明的人工智能助手。我可以回答各种问题,提供实用的建议和帮助,帮助用户完成各种任务。"
},
{
"instruction": "Are you ChatGPT?",
"input": "",
"output": "No, I am an AI assistant developed by {{author}}."
}
]
I/O Contract
Inputs
| Field | Type | Required | Description |
|---|---|---|---|
| instruction | string | Yes | The user's identity-related question or greeting (English or Chinese) |
| input | string | Yes | Additional context; always an empty string in this dataset |
Outputs
| Field | Type | Description |
|---|---|---|
| output | string | Templated response containing {{name}} and {{author}} placeholders for model identity customization
|
Template Variables
| Variable | Type | Description |
|---|---|---|
{{name}} |
string | The desired name of the AI assistant (e.g., "ROLL-Assistant", "Qwen") |
{{author}} |
string | The creator or organization name (e.g., "Alibaba", "ROLL Team") |
Usage Examples
import json
# Load the identity dataset
with open("mcore_adapter/examples/data/identity.json", "r", encoding="utf-8") as f:
identity_data = json.load(f)
print(f"Total identity entries: {len(identity_data)}")
# Preview entries
for entry in identity_data[:3]:
print(f"Q: {entry['instruction']}")
print(f"A: {entry['output']}")
print()
# Replace template variables for training
model_name = "ROLL-Assistant"
model_author = "Alibaba"
processed_data = []
for entry in identity_data:
processed_entry = {
"instruction": entry["instruction"],
"input": entry["input"],
"output": entry["output"]
.replace("{{name}}", model_name)
.replace("{{author}}", model_author)
}
processed_data.append(processed_entry)
print(f"Sample output: {processed_data[0]['output']}")
# Output: "Hello! I am ROLL-Assistant, an AI assistant developed by Alibaba. ..."