Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Lm sys FastChat Hardcoded Questions And Merge

From Leeroopedia


Field Value
Page Type Implementation
Title Hardcoded Questions And Merge
Repository lm-sys/FastChat
Knowledge Sources Source Code Analysis, API Documentation
Domains Data Preprocessing, NLP Pipeline, Model Identity, Fine-Tuning
Last Updated 2026-02-07 14:00 GMT

Overview

Hardcoded Questions And Merge is a composite implementation covering the final stage of the FastChat ShareGPT Data Pipeline. It encompasses four modules: hardcoded_questions (identity Q&A generation), merge (dataset concatenation), extract_gpt4_only (GPT-4 conversation filtering), and extract_single_round (single-turn extraction). Together, these modules inject model identity into the training data and produce specialized dataset variants.

Description

This implementation spans four separate Python modules that are invoked sequentially in the pipeline:

  1. hardcoded_questions.py -- Generates ~793 identity Q&A conversation pairs using combinatorial expansion of question templates and answer templates. The model identity is hardcoded as name="Vicuna" and org="Large Model Systems Organization (LMSYS)".
  2. merge.py -- Concatenates multiple JSON conversation files into a single output file. Used to combine the training split with the identity Q&A data.
  3. extract_gpt4_only.py -- Filters conversations to retain only those generated by GPT-4 (where model == "gpt4" or model is None).
  4. extract_single_round.py -- Truncates all conversations to their first 2 turns (one human question, one gpt response).

Usage

CLI Invocations

# Generate identity Q&A pairs (outputs hardcoded.json)
python3 -m fastchat.data.hardcoded_questions

# Merge training data with identity pairs
python3 -m fastchat.data.merge --in train.json hardcoded.json --out merged.json

# Extract GPT-4-only conversations
python3 -m fastchat.data.extract_gpt4_only --in merged.json

# Extract single-round conversations
python3 -m fastchat.data.extract_single_round --in merged.json

CLI Parameters

hardcoded_questions:

No parameters. Outputs hardcoded.json in the current directory.

merge:

Parameter Type Required Default Description
--in-file str (nargs=+) Yes -- One or more input JSON files to merge
--out-file str No merged.json Path to output merged JSON file

extract_gpt4_only:

Parameter Type Required Default Description
--in-file str Yes -- Path to input JSON file
--out-file str No {input}_gpt4.json Path to output JSON file (auto-derived if omitted)
--begin int No None Start index for slicing
--end int No None End index for slicing

extract_single_round:

Parameter Type Required Default Description
--in-file str Yes -- Path to input JSON file
--out-file str No {input}_single.json Path to output JSON file (auto-derived if omitted)
--begin int No None Start index for slicing
--end int No None End index for slicing

Programmatic Import

from fastchat.data.hardcoded_questions import identity_questions

The other modules (merge, extract_gpt4_only, extract_single_round) are inline CLI scripts without importable functions.

Code Reference

Source Locations

Module File Key Lines
hardcoded_questions fastchat/data/hardcoded_questions.py Lines 7-159 (identity_questions function)
merge fastchat/data/merge.py Lines 11-23 (inline script)
extract_gpt4_only fastchat/data/extract_gpt4_only.py Lines 10-32 (inline script)
extract_single_round fastchat/data/extract_single_round.py Lines 10-29 (inline script)
Repository github.com/lm-sys/FastChat

Function Signatures

def identity_questions() -> list[dict]:
    """
    Generate hardcoded identity Q&A pairs for model identity training.

    Creates ~793 conversation entries by combinatorial expansion of:
    - Self-identification questions (12) x answers (6) = 72 pairs
    - Creator questions (7) x answers (7) = 49 pairs
    - Negative identity questions (~56) x answers (12) = ~672 pairs

    Identity values:
        name = "Vicuna"
        org = "Large Model Systems Organization (LMSYS)"

    Each entry has the format:
        {
            "id": "identity_{index}",
            "conversations": [
                {"from": "human", "value": "<question>"},
                {"from": "gpt", "value": "<answer>"}
            ]
        }

    Returns:
        List of conversation dicts (~793 entries).
    """

Merge logic (inline):

# Reads multiple JSON files and concatenates their contents:
new_content = []
for in_file in args.in_file:
    content = json.load(open(in_file, "r"))
    new_content.extend(content)
json.dump(new_content, open(args.out_file, "w"), indent=2, ensure_ascii=False)

extract_gpt4_only logic (inline):

# Filters by model field:
for c in content:
    model = c.get("model", None)
    if model == "gpt4" or model is None:
        new_content.append(c)

extract_single_round logic (inline):

# Truncates to first 2 turns:
for c in content:
    c["conversations"] = c["conversations"][:2]

Import

from fastchat.data.hardcoded_questions import identity_questions

The merge, extract_gpt4_only, and extract_single_round modules are inline scripts and do not expose importable functions.

I/O Contract

hardcoded_questions

Direction Type Description
Input None No input file; generates data programmatically
Output hardcoded.json JSON file containing ~793 identity conversation entries

merge

Direction Type Description
Input Multiple JSON files Two or more JSON files to concatenate (e.g., training split + hardcoded identity data)
Output Single JSON file Merged JSON containing all conversations from all input files

extract_gpt4_only

Direction Type Description
Input JSON file Merged conversation JSON (training + identity)
Output {input}_gpt4.json Subset containing only conversations where model == "gpt4" or model is None

extract_single_round

Direction Type Description
Input JSON file Merged conversation JSON (training + identity)
Output {input}_single.json All conversations truncated to first 2 turns (1 human + 1 gpt)

Dependencies

Package Purpose
json Standard library JSON serialization (all modules)
numpy Not required (no external dependencies beyond stdlib)

Usage Examples

Full Pipeline Sequence (from prepare_all.py)

# Generate identity data
python3 -m fastchat.data.hardcoded_questions

# Merge with training split
python3 -m fastchat.data.merge \
    --in ~/datasets/sharegpt_20230521_4k_clean_lang_split_train.json hardcoded.json \
    --out ~/datasets/sharegpt_20230521_4k_clean_lang_split_identity.json

# Create GPT-4-only variant
python3 -m fastchat.data.extract_gpt4_only \
    --in ~/datasets/sharegpt_20230521_4k_clean_lang_split_identity.json

# Create single-round variant
python3 -m fastchat.data.extract_single_round \
    --in ~/datasets/sharegpt_20230521_4k_clean_lang_split_identity.json

This produces the following output files:

  • hardcoded.json -- ~793 identity Q&A entries
  • sharegpt_..._identity.json -- merged training + identity data
  • sharegpt_..._identity_gpt4.json -- GPT-4-only subset
  • sharegpt_..._identity_single.json -- single-round subset

Programmatic Identity Data Generation

from fastchat.data.hardcoded_questions import identity_questions

identity_data = identity_questions()
print(f"Generated {len(identity_data)} identity Q&A pairs")
# Output: Generated 793 identity Q&A pairs

# Inspect a sample
sample = identity_data[0]
print(f"Q: {sample['conversations'][0]['value']}")
print(f"A: {sample['conversations'][1]['value']}")
# Q: Who are you?
# A: I am Vicuna, a language model trained by researchers from Large Model Systems Organization (LMSYS).

Custom Identity

To create a model with a different identity, you would need to modify the name and org variables in hardcoded_questions.py:

# In hardcoded_questions.py, line 13-14:
name = "Vicuna"                                          # Change this
org = "Large Model Systems Organization (LMSYS)"         # Change this

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment