Implementation:Lm sys FastChat Hardcoded Questions And Merge
| Field | Value |
|---|---|
| Page Type | Implementation |
| Title | Hardcoded Questions And Merge |
| Repository | lm-sys/FastChat |
| Knowledge Sources | Source Code Analysis, API Documentation |
| Domains | Data Preprocessing, NLP Pipeline, Model Identity, Fine-Tuning |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Hardcoded Questions And Merge is a composite implementation covering the final stage of the FastChat ShareGPT Data Pipeline. It encompasses four modules: hardcoded_questions (identity Q&A generation), merge (dataset concatenation), extract_gpt4_only (GPT-4 conversation filtering), and extract_single_round (single-turn extraction). Together, these modules inject model identity into the training data and produce specialized dataset variants.
Description
This implementation spans four separate Python modules that are invoked sequentially in the pipeline:
- hardcoded_questions.py -- Generates ~793 identity Q&A conversation pairs using combinatorial expansion of question templates and answer templates. The model identity is hardcoded as name="Vicuna" and org="Large Model Systems Organization (LMSYS)".
- merge.py -- Concatenates multiple JSON conversation files into a single output file. Used to combine the training split with the identity Q&A data.
- extract_gpt4_only.py -- Filters conversations to retain only those generated by GPT-4 (where
model == "gpt4"ormodel is None). - extract_single_round.py -- Truncates all conversations to their first 2 turns (one human question, one gpt response).
Usage
CLI Invocations
# Generate identity Q&A pairs (outputs hardcoded.json)
python3 -m fastchat.data.hardcoded_questions
# Merge training data with identity pairs
python3 -m fastchat.data.merge --in train.json hardcoded.json --out merged.json
# Extract GPT-4-only conversations
python3 -m fastchat.data.extract_gpt4_only --in merged.json
# Extract single-round conversations
python3 -m fastchat.data.extract_single_round --in merged.json
CLI Parameters
hardcoded_questions:
No parameters. Outputs hardcoded.json in the current directory.
merge:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
--in-file |
str (nargs=+) | Yes | -- | One or more input JSON files to merge |
--out-file |
str | No | merged.json |
Path to output merged JSON file |
extract_gpt4_only:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
--in-file |
str | Yes | -- | Path to input JSON file |
--out-file |
str | No | {input}_gpt4.json |
Path to output JSON file (auto-derived if omitted) |
--begin |
int | No | None | Start index for slicing |
--end |
int | No | None | End index for slicing |
extract_single_round:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
--in-file |
str | Yes | -- | Path to input JSON file |
--out-file |
str | No | {input}_single.json |
Path to output JSON file (auto-derived if omitted) |
--begin |
int | No | None | Start index for slicing |
--end |
int | No | None | End index for slicing |
Programmatic Import
from fastchat.data.hardcoded_questions import identity_questions
The other modules (merge, extract_gpt4_only, extract_single_round) are inline CLI scripts without importable functions.
Code Reference
Source Locations
| Module | File | Key Lines |
|---|---|---|
| hardcoded_questions | fastchat/data/hardcoded_questions.py |
Lines 7-159 (identity_questions function)
|
| merge | fastchat/data/merge.py |
Lines 11-23 (inline script) |
| extract_gpt4_only | fastchat/data/extract_gpt4_only.py |
Lines 10-32 (inline script) |
| extract_single_round | fastchat/data/extract_single_round.py |
Lines 10-29 (inline script) |
| Repository | github.com/lm-sys/FastChat | |
Function Signatures
def identity_questions() -> list[dict]:
"""
Generate hardcoded identity Q&A pairs for model identity training.
Creates ~793 conversation entries by combinatorial expansion of:
- Self-identification questions (12) x answers (6) = 72 pairs
- Creator questions (7) x answers (7) = 49 pairs
- Negative identity questions (~56) x answers (12) = ~672 pairs
Identity values:
name = "Vicuna"
org = "Large Model Systems Organization (LMSYS)"
Each entry has the format:
{
"id": "identity_{index}",
"conversations": [
{"from": "human", "value": "<question>"},
{"from": "gpt", "value": "<answer>"}
]
}
Returns:
List of conversation dicts (~793 entries).
"""
Merge logic (inline):
# Reads multiple JSON files and concatenates their contents:
new_content = []
for in_file in args.in_file:
content = json.load(open(in_file, "r"))
new_content.extend(content)
json.dump(new_content, open(args.out_file, "w"), indent=2, ensure_ascii=False)
extract_gpt4_only logic (inline):
# Filters by model field:
for c in content:
model = c.get("model", None)
if model == "gpt4" or model is None:
new_content.append(c)
extract_single_round logic (inline):
# Truncates to first 2 turns:
for c in content:
c["conversations"] = c["conversations"][:2]
Import
from fastchat.data.hardcoded_questions import identity_questions
The merge, extract_gpt4_only, and extract_single_round modules are inline scripts and do not expose importable functions.
I/O Contract
hardcoded_questions
| Direction | Type | Description |
|---|---|---|
| Input | None | No input file; generates data programmatically |
| Output | hardcoded.json |
JSON file containing ~793 identity conversation entries |
merge
| Direction | Type | Description |
|---|---|---|
| Input | Multiple JSON files | Two or more JSON files to concatenate (e.g., training split + hardcoded identity data) |
| Output | Single JSON file | Merged JSON containing all conversations from all input files |
extract_gpt4_only
| Direction | Type | Description |
|---|---|---|
| Input | JSON file | Merged conversation JSON (training + identity) |
| Output | {input}_gpt4.json |
Subset containing only conversations where model == "gpt4" or model is None
|
extract_single_round
| Direction | Type | Description |
|---|---|---|
| Input | JSON file | Merged conversation JSON (training + identity) |
| Output | {input}_single.json |
All conversations truncated to first 2 turns (1 human + 1 gpt) |
Dependencies
| Package | Purpose |
|---|---|
| json | Standard library JSON serialization (all modules) |
| numpy | Not required (no external dependencies beyond stdlib) |
Usage Examples
Full Pipeline Sequence (from prepare_all.py)
# Generate identity data
python3 -m fastchat.data.hardcoded_questions
# Merge with training split
python3 -m fastchat.data.merge \
--in ~/datasets/sharegpt_20230521_4k_clean_lang_split_train.json hardcoded.json \
--out ~/datasets/sharegpt_20230521_4k_clean_lang_split_identity.json
# Create GPT-4-only variant
python3 -m fastchat.data.extract_gpt4_only \
--in ~/datasets/sharegpt_20230521_4k_clean_lang_split_identity.json
# Create single-round variant
python3 -m fastchat.data.extract_single_round \
--in ~/datasets/sharegpt_20230521_4k_clean_lang_split_identity.json
This produces the following output files:
hardcoded.json-- ~793 identity Q&A entriessharegpt_..._identity.json-- merged training + identity datasharegpt_..._identity_gpt4.json-- GPT-4-only subsetsharegpt_..._identity_single.json-- single-round subset
Programmatic Identity Data Generation
from fastchat.data.hardcoded_questions import identity_questions
identity_data = identity_questions()
print(f"Generated {len(identity_data)} identity Q&A pairs")
# Output: Generated 793 identity Q&A pairs
# Inspect a sample
sample = identity_data[0]
print(f"Q: {sample['conversations'][0]['value']}")
print(f"A: {sample['conversations'][1]['value']}")
# Q: Who are you?
# A: I am Vicuna, a language model trained by researchers from Large Model Systems Organization (LMSYS).
Custom Identity
To create a model with a different identity, you would need to modify the name and org variables in hardcoded_questions.py:
# In hardcoded_questions.py, line 13-14:
name = "Vicuna" # Change this
org = "Large Model Systems Organization (LMSYS)" # Change this
Related Pages
- Principle:Lm_sys_FastChat_Identity_Data_Injection
- Principle:Lm_sys_FastChat_Identity_Data_Injection -- The principle that this implementation realizes
- Implementation:Lm_sys_FastChat_Split_Train_Test -- Previous pipeline step: train/test splitting
- Implementation:Lm_sys_FastChat_Clean_ShareGPT -- First pipeline step: HTML cleaning