Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Lm sys FastChat Split Train Test

From Leeroopedia


Field Value
Page Type Implementation
Title Split Train Test
Repository lm-sys/FastChat
Knowledge Sources Source Code Analysis, API Documentation
Domains Data Preprocessing, NLP Pipeline, Model Evaluation
Last Updated 2026-02-07 14:00 GMT

Overview

Split Train Test is the implementation module that divides the cleaned and validated ShareGPT conversation dataset into separate training and test JSON files. It uses NumPy's random permutation with a fixed seed (seed=0) for reproducible shuffling and supports a configurable split ratio.

Description

This module is implemented as an inline CLI script (no importable functions). It reads a JSON file of conversations, shuffles them deterministically using np.random.seed(0) and np.random.permutation, splits them at the configured ratio, and writes the two resulting subsets to separate JSON files with _train.json and _test.json suffixes.

The module is intentionally simple -- the split logic is only a few lines of code. Its value lies in the reproducibility guarantee provided by the fixed seed and the standardized output naming convention.

Usage

CLI Invocation

python3 -m fastchat.data.split_train_test --in sharegpt_clean_lang_split.json --ratio 0.99

CLI Parameters

Parameter Type Required Default Description
--in-file str Yes -- Path to input JSON file (format-validated conversations)
--begin int No 0 Start index (not typically used in pipeline)
--end int No 100 End index (not typically used in pipeline)
--ratio float No 0.9 Training set fraction (pipeline uses 0.99)

Programmatic Import

This module is an inline script and does not expose importable functions. There is no function to import; the logic is entirely within the if __name__ == "__main__" block.

Code Reference

Source Location

Item Location
Module fastchat/data/split_train_test.py
Script Logic Lines 12-34
Repository github.com/lm-sys/FastChat

Inline Script Logic

# Core logic (lines 20-34):
content = json.load(open(args.in_file, "r"))
np.random.seed(0)                              # Fixed seed for reproducibility

perm = np.random.permutation(len(content))     # Random permutation of indices
content = [content[i] for i in perm]            # Shuffle content
split = int(args.ratio * len(content))          # Calculate split point

train_set = content[:split]                     # First portion = training
test_set = content[split:]                      # Remainder = test

# Output file names derived from input
train_name = args.in_file.replace(".json", "_train.json")
test_name = args.in_file.replace(".json", "_test.json")

json.dump(train_set, open(train_name, "w"), indent=2, ensure_ascii=False)
json.dump(test_set, open(test_name, "w"), indent=2, ensure_ascii=False)

Import

N/A -- This is an inline CLI script. There are no functions available for import. To replicate the logic programmatically, use the NumPy operations directly:

import numpy as np
np.random.seed(0)
perm = np.random.permutation(len(data))
shuffled = [data[i] for i in perm]

I/O Contract

Inputs

Input Type Description
in_file JSON file Format-validated ShareGPT JSON: a list of conversation dicts. This is typically the output of the format filtering step.

Outputs

Output Type Naming Convention Description
train file JSON file {input_prefix}_train.json Training subset containing ratio fraction of shuffled conversations
test file JSON file {input_prefix}_test.json Test subset containing 1 - ratio fraction of shuffled conversations

Example: With input sharegpt_clean_lang_split.json and --ratio 0.99:

  • sharegpt_clean_lang_split_train.json -- 99% of conversations
  • sharegpt_clean_lang_split_test.json -- 1% of conversations

Dependencies

Package Purpose
numpy Random seed, permutation for reproducible shuffling

Usage Examples

Pipeline Usage (from prepare_all.py)

python3 -m fastchat.data.split_train_test \
    --in ~/datasets/sharegpt_20230521_4k_clean_lang_split.json \
    --ratio 0.99

This produces:

  • ~/datasets/sharegpt_20230521_4k_clean_lang_split_train.json
  • ~/datasets/sharegpt_20230521_4k_clean_lang_split_test.json

Default Ratio (90/10)

python3 -m fastchat.data.split_train_test --in sharegpt_processed.json

Uses the default ratio of 0.9, producing a 90/10 train/test split.

Verifying Reproducibility

import json
import numpy as np

# Run 1
content = json.load(open("data.json"))
np.random.seed(0)
perm1 = np.random.permutation(len(content))

# Run 2
np.random.seed(0)
perm2 = np.random.permutation(len(content))

# These are always identical
assert all(perm1 == perm2)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment