Principle:Lm sys FastChat Identity Data Injection
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | Identity Data Injection |
| Repository | lm-sys/FastChat |
| Knowledge Sources | Source Code Analysis, API Documentation |
| Domains | Data Preprocessing, NLP Pipeline, Model Identity, Fine-Tuning |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Identity Data Injection is the final data preparation principle in the FastChat ShareGPT Data Pipeline. It governs how the fine-tuned model's identity is established through hardcoded question-answer pairs, how multiple datasets are merged, and how specialized data variants (GPT-4-only conversations, single-round conversations) are extracted from the combined dataset.
Description
Model Identity (Name and Organization)
A critical aspect of instruction-tuned language models is their self-identity. Without explicit identity training data, a model fine-tuned on ChatGPT conversations would identify itself as ChatGPT, which is incorrect and potentially misleading. The Identity Data Injection principle addresses this by injecting hardcoded identity Q&A pairs that teach the model:
- Name: "Vicuna"
- Organization: "Large Model Systems Organization (LMSYS)"
These values are embedded directly in the source code, ensuring consistency across all training runs and preventing the model from inheriting the identity of the source data's original model.
Hardcoded Q&A Pairs for Identity Questions
The identity dataset is generated through a combinatorial approach. Three categories of identity questions are defined:
- Self-identification questions: "Who are you?", "What is your name?", "Can you introduce yourself?", etc. (12 questions x 6 answers = 72 pairs)
- Creator questions: "Who created you?", "Who trained you?", etc. (7 questions x 7 answers = 49 pairs)
- Negative identity questions: "Are you ChatGPT?", "Are you trained by OpenAI?", "Are you based on GPT-4?", etc. (approximately 56 questions x 12 answers = 672 pairs)
The combinatorial expansion of questions and answers produces approximately 793 total identity pairs. This large number ensures that the model encounters identity-related training examples frequently enough to learn robust self-identification behavior.
Dataset Merging Strategies
After identity data is generated, it must be combined with the main training set. The merge operation is a simple concatenation -- the identity JSON entries are appended to the training split. This means the model sees both ShareGPT conversations and identity Q&A pairs during training.
The merge is performed after train/test splitting to ensure that:
- Identity data only enters the training set (not the test set)
- The test set reflects genuine conversation quality without artificial identity examples
GPT-4-Only Extraction
The pipeline produces a specialized variant containing only conversations generated by GPT-4 (or conversations with no model attribution, which are assumed to be from an early GPT-4 period). This variant is useful for:
- Training higher-quality models on GPT-4-caliber responses only
- Studying the quality difference between GPT-3.5 and GPT-4 training data
- The filter checks the
"model"field, keeping entries wheremodel == "gpt4"ormodel is None.
Single-Round Extraction
Another specialized variant truncates all conversations to their first two turns (one human question, one gpt answer). This produces a single-round dataset useful for:
- Training models on simple Q&A tasks without multi-turn context
- Creating evaluation sets that test basic instruction-following ability
- Reducing computational requirements for smaller-scale experiments
Usage
In the standard FastChat pipeline, identity injection and variant extraction comprise the final steps (steps 6-9):
# Step 6: Generate identity Q&A pairs
python3 -m fastchat.data.hardcoded_questions
# Step 7: Merge training data with identity pairs
python3 -m fastchat.data.merge \
--in sharegpt_clean_lang_split_train.json hardcoded.json \
--out sharegpt_clean_lang_split_identity.json
# Step 8: Extract GPT-4-only variant
python3 -m fastchat.data.extract_gpt4_only \
--in sharegpt_clean_lang_split_identity.json
# Step 9: Extract single-round variant
python3 -m fastchat.data.extract_single_round \
--in sharegpt_clean_lang_split_identity.json
Theoretical Basis
Identity data injection draws from several established principles in NLP and machine learning:
- Behavioral alignment through data: The most reliable way to control a language model's behavior is through its training data. Hardcoded identity examples provide a direct, interpretable mechanism for establishing model identity, as opposed to post-hoc prompting or system messages alone.
- Combinatorial data augmentation: By generating all combinations of questions and answers, the training set covers a wide variety of phrasings for identity-related queries. This helps the model generalize to novel identity questions it has not seen verbatim, rather than memorizing specific question-answer pairs.
- Negative example training: The large set of "Are you ChatGPT?" / "Are you trained by OpenAI?" examples with explicit "No" answers is a form of negative training. It teaches the model to actively deny incorrect identity attributions, which is essential when the underlying training data was generated by ChatGPT.
- Data stratification: Producing GPT-4-only and single-round variants allows practitioners to train multiple model variants from the same base pipeline, supporting ablation studies and specialized use cases.
- Merge ordering: Injecting identity data into only the training split (not the test split) maintains the integrity of the evaluation set and prevents artificially inflated metrics on identity-related queries.
Related Pages
- Implementation:Lm_sys_FastChat_Hardcoded_Questions_And_Merge
- Implementation:Lm_sys_FastChat_Hardcoded_Questions_And_Merge -- The implementation that realizes this principle
- Principle:Lm_sys_FastChat_Train_Test_Data_Splitting -- Previous pipeline stage: train/test splitting
- Principle:Lm_sys_FastChat_ShareGPT_HTML_Cleaning -- First pipeline stage: HTML cleaning