Implementation:Lm sys FastChat Split Train Test

Field	Value
Page Type	Implementation
Title	Split Train Test
Repository	lm-sys/FastChat
Knowledge Sources	Source Code Analysis, API Documentation
Domains	Data Preprocessing, NLP Pipeline, Model Evaluation
Last Updated	2026-02-07 14:00 GMT

Overview

Split Train Test is the implementation module that divides the cleaned and validated ShareGPT conversation dataset into separate training and test JSON files. It uses NumPy's random permutation with a fixed seed (seed=0) for reproducible shuffling and supports a configurable split ratio.

Description

This module is implemented as an inline CLI script (no importable functions). It reads a JSON file of conversations, shuffles them deterministically using np.random.seed(0) and np.random.permutation, splits them at the configured ratio, and writes the two resulting subsets to separate JSON files with _train.json and _test.json suffixes.

The module is intentionally simple -- the split logic is only a few lines of code. Its value lies in the reproducibility guarantee provided by the fixed seed and the standardized output naming convention.

Usage

CLI Invocation

python3 -m fastchat.data.split_train_test --in sharegpt_clean_lang_split.json --ratio 0.99

CLI Parameters

Parameter	Type	Required	Default	Description
`--in-file`	str	Yes	--	Path to input JSON file (format-validated conversations)
`--begin`	int	No	`0`	Start index (not typically used in pipeline)
`--end`	int	No	`100`	End index (not typically used in pipeline)
`--ratio`	float	No	`0.9`	Training set fraction (pipeline uses `0.99`)

Programmatic Import

This module is an inline script and does not expose importable functions. There is no function to import; the logic is entirely within the if __name__ == "__main__" block.

Code Reference

Source Location

Item	Location
Module	`fastchat/data/split_train_test.py`
Script Logic	Lines 12-34
Repository	github.com/lm-sys/FastChat

Inline Script Logic

# Core logic (lines 20-34):
content = json.load(open(args.in_file, "r"))
np.random.seed(0)                              # Fixed seed for reproducibility

perm = np.random.permutation(len(content))     # Random permutation of indices
content = [content[i] for i in perm]            # Shuffle content
split = int(args.ratio * len(content))          # Calculate split point

train_set = content[:split]                     # First portion = training
test_set = content[split:]                      # Remainder = test

# Output file names derived from input
train_name = args.in_file.replace(".json", "_train.json")
test_name = args.in_file.replace(".json", "_test.json")

json.dump(train_set, open(train_name, "w"), indent=2, ensure_ascii=False)
json.dump(test_set, open(test_name, "w"), indent=2, ensure_ascii=False)

Import

N/A -- This is an inline CLI script. There are no functions available for import. To replicate the logic programmatically, use the NumPy operations directly:

import numpy as np
np.random.seed(0)
perm = np.random.permutation(len(data))
shuffled = [data[i] for i in perm]

I/O Contract

Inputs

Input	Type	Description
in_file	JSON file	Format-validated ShareGPT JSON: a list of conversation dicts. This is typically the output of the format filtering step.

Outputs

Output	Type	Naming Convention	Description
train file	JSON file	`{input_prefix}_train.json`	Training subset containing `ratio` fraction of shuffled conversations
test file	JSON file	`{input_prefix}_test.json`	Test subset containing `1 - ratio` fraction of shuffled conversations

Example: With input sharegpt_clean_lang_split.json and --ratio 0.99:

sharegpt_clean_lang_split_train.json -- 99% of conversations
sharegpt_clean_lang_split_test.json -- 1% of conversations

Dependencies

Package	Purpose
numpy	Random seed, permutation for reproducible shuffling

Usage Examples

Pipeline Usage (from prepare_all.py)

python3 -m fastchat.data.split_train_test \
    --in ~/datasets/sharegpt_20230521_4k_clean_lang_split.json \
    --ratio 0.99

This produces:

~/datasets/sharegpt_20230521_4k_clean_lang_split_train.json
~/datasets/sharegpt_20230521_4k_clean_lang_split_test.json

Default Ratio (90/10)

python3 -m fastchat.data.split_train_test --in sharegpt_processed.json

Uses the default ratio of 0.9, producing a 90/10 train/test split.

Verifying Reproducibility

import json
import numpy as np

# Run 1
content = json.load(open("data.json"))
np.random.seed(0)
perm1 = np.random.permutation(len(content))

# Run 2
np.random.seed(0)
perm2 = np.random.permutation(len(content))

# These are always identical
assert all(perm1 == perm2)

Related Pages

Principle:Lm_sys_FastChat_Train_Test_Data_Splitting
Principle:Lm_sys_FastChat_Train_Test_Data_Splitting -- The principle that this implementation realizes
Implementation:Lm_sys_FastChat_Filter_Wrong_Format -- Previous pipeline step: format validation
Implementation:Lm_sys_FastChat_Hardcoded_Questions_And_Merge -- Next pipeline step: identity data injection

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment