Implementation:Lm sys FastChat Hardcoded Questions And Merge

Field	Value
Page Type	Implementation
Title	Hardcoded Questions And Merge
Repository	lm-sys/FastChat
Knowledge Sources	Source Code Analysis, API Documentation
Domains	Data Preprocessing, NLP Pipeline, Model Identity, Fine-Tuning
Last Updated	2026-02-07 14:00 GMT

Overview

Hardcoded Questions And Merge is a composite implementation covering the final stage of the FastChat ShareGPT Data Pipeline. It encompasses four modules: hardcoded_questions (identity Q&A generation), merge (dataset concatenation), extract_gpt4_only (GPT-4 conversation filtering), and extract_single_round (single-turn extraction). Together, these modules inject model identity into the training data and produce specialized dataset variants.

Description

This implementation spans four separate Python modules that are invoked sequentially in the pipeline:

hardcoded_questions.py -- Generates ~793 identity Q&A conversation pairs using combinatorial expansion of question templates and answer templates. The model identity is hardcoded as name="Vicuna" and org="Large Model Systems Organization (LMSYS)".
merge.py -- Concatenates multiple JSON conversation files into a single output file. Used to combine the training split with the identity Q&A data.
extract_gpt4_only.py -- Filters conversations to retain only those generated by GPT-4 (where model == "gpt4" or model is None).
extract_single_round.py -- Truncates all conversations to their first 2 turns (one human question, one gpt response).

Usage

CLI Invocations

# Generate identity Q&A pairs (outputs hardcoded.json)
python3 -m fastchat.data.hardcoded_questions

# Merge training data with identity pairs
python3 -m fastchat.data.merge --in train.json hardcoded.json --out merged.json

# Extract GPT-4-only conversations
python3 -m fastchat.data.extract_gpt4_only --in merged.json

# Extract single-round conversations
python3 -m fastchat.data.extract_single_round --in merged.json

CLI Parameters

hardcoded_questions:

No parameters. Outputs hardcoded.json in the current directory.

merge:

Parameter	Type	Required	Default	Description
`--in-file`	str (nargs=+)	Yes	--	One or more input JSON files to merge
`--out-file`	str	No	`merged.json`	Path to output merged JSON file

extract_gpt4_only:

Parameter	Type	Required	Default	Description
`--in-file`	str	Yes	--	Path to input JSON file
`--out-file`	str	No	`{input}_gpt4.json`	Path to output JSON file (auto-derived if omitted)
`--begin`	int	No	None	Start index for slicing
`--end`	int	No	None	End index for slicing

extract_single_round:

Parameter	Type	Required	Default	Description
`--in-file`	str	Yes	--	Path to input JSON file
`--out-file`	str	No	`{input}_single.json`	Path to output JSON file (auto-derived if omitted)
`--begin`	int	No	None	Start index for slicing
`--end`	int	No	None	End index for slicing

Programmatic Import

from fastchat.data.hardcoded_questions import identity_questions

The other modules (merge, extract_gpt4_only, extract_single_round) are inline CLI scripts without importable functions.

Code Reference

Source Locations

Module	File	Key Lines
hardcoded_questions	`fastchat/data/hardcoded_questions.py`	Lines 7-159 (`identity_questions` function)
merge	`fastchat/data/merge.py`	Lines 11-23 (inline script)
extract_gpt4_only	`fastchat/data/extract_gpt4_only.py`	Lines 10-32 (inline script)
extract_single_round	`fastchat/data/extract_single_round.py`	Lines 10-29 (inline script)
Repository	github.com/lm-sys/FastChat

Function Signatures

def identity_questions() -> list[dict]:
    """
    Generate hardcoded identity Q&A pairs for model identity training.

    Creates ~793 conversation entries by combinatorial expansion of:
    - Self-identification questions (12) x answers (6) = 72 pairs
    - Creator questions (7) x answers (7) = 49 pairs
    - Negative identity questions (~56) x answers (12) = ~672 pairs

    Identity values:
        name = "Vicuna"
        org = "Large Model Systems Organization (LMSYS)"

    Each entry has the format:
        {
            "id": "identity_{index}",
            "conversations": [
                {"from": "human", "value": "<question>"},
                {"from": "gpt", "value": "<answer>"}
            ]
        }

    Returns:
        List of conversation dicts (~793 entries).
    """

Merge logic (inline):

# Reads multiple JSON files and concatenates their contents:
new_content = []
for in_file in args.in_file:
    content = json.load(open(in_file, "r"))
    new_content.extend(content)
json.dump(new_content, open(args.out_file, "w"), indent=2, ensure_ascii=False)

extract_gpt4_only logic (inline):

# Filters by model field:
for c in content:
    model = c.get("model", None)
    if model == "gpt4" or model is None:
        new_content.append(c)

extract_single_round logic (inline):

# Truncates to first 2 turns:
for c in content:
    c["conversations"] = c["conversations"][:2]

Import

from fastchat.data.hardcoded_questions import identity_questions

The merge, extract_gpt4_only, and extract_single_round modules are inline scripts and do not expose importable functions.

I/O Contract

hardcoded_questions

Direction	Type	Description
Input	None	No input file; generates data programmatically
Output	`hardcoded.json`	JSON file containing ~793 identity conversation entries

merge

Direction	Type	Description
Input	Multiple JSON files	Two or more JSON files to concatenate (e.g., training split + hardcoded identity data)
Output	Single JSON file	Merged JSON containing all conversations from all input files

extract_gpt4_only

Direction	Type	Description
Input	JSON file	Merged conversation JSON (training + identity)
Output	`{input}_gpt4.json`	Subset containing only conversations where `model == "gpt4"` or `model is None`

extract_single_round

Direction	Type	Description
Input	JSON file	Merged conversation JSON (training + identity)
Output	`{input}_single.json`	All conversations truncated to first 2 turns (1 human + 1 gpt)

Dependencies

Package	Purpose
json	Standard library JSON serialization (all modules)
numpy	Not required (no external dependencies beyond stdlib)

Usage Examples

Full Pipeline Sequence (from prepare_all.py)

# Generate identity data
python3 -m fastchat.data.hardcoded_questions

# Merge with training split
python3 -m fastchat.data.merge \
    --in ~/datasets/sharegpt_20230521_4k_clean_lang_split_train.json hardcoded.json \
    --out ~/datasets/sharegpt_20230521_4k_clean_lang_split_identity.json

# Create GPT-4-only variant
python3 -m fastchat.data.extract_gpt4_only \
    --in ~/datasets/sharegpt_20230521_4k_clean_lang_split_identity.json

# Create single-round variant
python3 -m fastchat.data.extract_single_round \
    --in ~/datasets/sharegpt_20230521_4k_clean_lang_split_identity.json

This produces the following output files:

hardcoded.json -- ~793 identity Q&A entries
sharegpt_..._identity.json -- merged training + identity data
sharegpt_..._identity_gpt4.json -- GPT-4-only subset
sharegpt_..._identity_single.json -- single-round subset

Programmatic Identity Data Generation

from fastchat.data.hardcoded_questions import identity_questions

identity_data = identity_questions()
print(f"Generated {len(identity_data)} identity Q&A pairs")
# Output: Generated 793 identity Q&A pairs

# Inspect a sample
sample = identity_data[0]
print(f"Q: {sample['conversations'][0]['value']}")
print(f"A: {sample['conversations'][1]['value']}")
# Q: Who are you?
# A: I am Vicuna, a language model trained by researchers from Large Model Systems Organization (LMSYS).

Custom Identity

To create a model with a different identity, you would need to modify the name and org variables in hardcoded_questions.py:

# In hardcoded_questions.py, line 13-14:
name = "Vicuna"                                          # Change this
org = "Large Model Systems Organization (LMSYS)"         # Change this

Related Pages

Principle:Lm_sys_FastChat_Identity_Data_Injection
Principle:Lm_sys_FastChat_Identity_Data_Injection -- The principle that this implementation realizes
Implementation:Lm_sys_FastChat_Split_Train_Test -- Previous pipeline step: train/test splitting
Implementation:Lm_sys_FastChat_Clean_ShareGPT -- First pipeline step: HTML cleaning

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment