Principle:Lm sys FastChat Identity Data Injection

Field	Value
Page Type	Principle
Title	Identity Data Injection
Repository	lm-sys/FastChat
Knowledge Sources	Source Code Analysis, API Documentation
Domains	Data Preprocessing, NLP Pipeline, Model Identity, Fine-Tuning
Last Updated	2026-02-07 14:00 GMT

Overview

Identity Data Injection is the final data preparation principle in the FastChat ShareGPT Data Pipeline. It governs how the fine-tuned model's identity is established through hardcoded question-answer pairs, how multiple datasets are merged, and how specialized data variants (GPT-4-only conversations, single-round conversations) are extracted from the combined dataset.

Description

Model Identity (Name and Organization)

A critical aspect of instruction-tuned language models is their self-identity. Without explicit identity training data, a model fine-tuned on ChatGPT conversations would identify itself as ChatGPT, which is incorrect and potentially misleading. The Identity Data Injection principle addresses this by injecting hardcoded identity Q&A pairs that teach the model:

Name: "Vicuna"
Organization: "Large Model Systems Organization (LMSYS)"

These values are embedded directly in the source code, ensuring consistency across all training runs and preventing the model from inheriting the identity of the source data's original model.

Hardcoded Q&A Pairs for Identity Questions

The identity dataset is generated through a combinatorial approach. Three categories of identity questions are defined:

Self-identification questions: "Who are you?", "What is your name?", "Can you introduce yourself?", etc. (12 questions x 6 answers = 72 pairs)
Creator questions: "Who created you?", "Who trained you?", etc. (7 questions x 7 answers = 49 pairs)
Negative identity questions: "Are you ChatGPT?", "Are you trained by OpenAI?", "Are you based on GPT-4?", etc. (approximately 56 questions x 12 answers = 672 pairs)

The combinatorial expansion of questions and answers produces approximately 793 total identity pairs. This large number ensures that the model encounters identity-related training examples frequently enough to learn robust self-identification behavior.

Dataset Merging Strategies

After identity data is generated, it must be combined with the main training set. The merge operation is a simple concatenation -- the identity JSON entries are appended to the training split. This means the model sees both ShareGPT conversations and identity Q&A pairs during training.

The merge is performed after train/test splitting to ensure that:

Identity data only enters the training set (not the test set)
The test set reflects genuine conversation quality without artificial identity examples

GPT-4-Only Extraction

The pipeline produces a specialized variant containing only conversations generated by GPT-4 (or conversations with no model attribution, which are assumed to be from an early GPT-4 period). This variant is useful for:

Training higher-quality models on GPT-4-caliber responses only
Studying the quality difference between GPT-3.5 and GPT-4 training data
The filter checks the "model" field, keeping entries where model == "gpt4" or model is None.

Single-Round Extraction

Another specialized variant truncates all conversations to their first two turns (one human question, one gpt answer). This produces a single-round dataset useful for:

Training models on simple Q&A tasks without multi-turn context
Creating evaluation sets that test basic instruction-following ability
Reducing computational requirements for smaller-scale experiments

Usage

In the standard FastChat pipeline, identity injection and variant extraction comprise the final steps (steps 6-9):

# Step 6: Generate identity Q&A pairs
python3 -m fastchat.data.hardcoded_questions

# Step 7: Merge training data with identity pairs
python3 -m fastchat.data.merge \
    --in sharegpt_clean_lang_split_train.json hardcoded.json \
    --out sharegpt_clean_lang_split_identity.json

# Step 8: Extract GPT-4-only variant
python3 -m fastchat.data.extract_gpt4_only \
    --in sharegpt_clean_lang_split_identity.json

# Step 9: Extract single-round variant
python3 -m fastchat.data.extract_single_round \
    --in sharegpt_clean_lang_split_identity.json

Theoretical Basis

Identity data injection draws from several established principles in NLP and machine learning:

Behavioral alignment through data: The most reliable way to control a language model's behavior is through its training data. Hardcoded identity examples provide a direct, interpretable mechanism for establishing model identity, as opposed to post-hoc prompting or system messages alone.
Combinatorial data augmentation: By generating all combinations of questions and answers, the training set covers a wide variety of phrasings for identity-related queries. This helps the model generalize to novel identity questions it has not seen verbatim, rather than memorizing specific question-answer pairs.
Negative example training: The large set of "Are you ChatGPT?" / "Are you trained by OpenAI?" examples with explicit "No" answers is a form of negative training. It teaches the model to actively deny incorrect identity attributions, which is essential when the underlying training data was generated by ChatGPT.
Data stratification: Producing GPT-4-only and single-round variants allows practitioners to train multiple model variants from the same base pipeline, supporting ablation studies and specialized use cases.
Merge ordering: Injecting identity data into only the training split (not the test split) maintains the integrity of the evaluation set and prevents artificially inflated metrics on identity-related queries.

Related Pages

Implementation:Lm_sys_FastChat_Hardcoded_Questions_And_Merge
Implementation:Lm_sys_FastChat_Hardcoded_Questions_And_Merge -- The implementation that realizes this principle
Principle:Lm_sys_FastChat_Train_Test_Data_Splitting -- Previous pipeline stage: train/test splitting
Principle:Lm_sys_FastChat_ShareGPT_HTML_Cleaning -- First pipeline stage: HTML cleaning

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment