Implementation:Lm sys FastChat Split Train Test
| Field | Value |
|---|---|
| Page Type | Implementation |
| Title | Split Train Test |
| Repository | lm-sys/FastChat |
| Knowledge Sources | Source Code Analysis, API Documentation |
| Domains | Data Preprocessing, NLP Pipeline, Model Evaluation |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Split Train Test is the implementation module that divides the cleaned and validated ShareGPT conversation dataset into separate training and test JSON files. It uses NumPy's random permutation with a fixed seed (seed=0) for reproducible shuffling and supports a configurable split ratio.
Description
This module is implemented as an inline CLI script (no importable functions). It reads a JSON file of conversations, shuffles them deterministically using np.random.seed(0) and np.random.permutation, splits them at the configured ratio, and writes the two resulting subsets to separate JSON files with _train.json and _test.json suffixes.
The module is intentionally simple -- the split logic is only a few lines of code. Its value lies in the reproducibility guarantee provided by the fixed seed and the standardized output naming convention.
Usage
CLI Invocation
python3 -m fastchat.data.split_train_test --in sharegpt_clean_lang_split.json --ratio 0.99
CLI Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
--in-file |
str | Yes | -- | Path to input JSON file (format-validated conversations) |
--begin |
int | No | 0 |
Start index (not typically used in pipeline) |
--end |
int | No | 100 |
End index (not typically used in pipeline) |
--ratio |
float | No | 0.9 |
Training set fraction (pipeline uses 0.99)
|
Programmatic Import
This module is an inline script and does not expose importable functions. There is no function to import; the logic is entirely within the if __name__ == "__main__" block.
Code Reference
Source Location
| Item | Location |
|---|---|
| Module | fastchat/data/split_train_test.py
|
| Script Logic | Lines 12-34 |
| Repository | github.com/lm-sys/FastChat |
Inline Script Logic
# Core logic (lines 20-34):
content = json.load(open(args.in_file, "r"))
np.random.seed(0) # Fixed seed for reproducibility
perm = np.random.permutation(len(content)) # Random permutation of indices
content = [content[i] for i in perm] # Shuffle content
split = int(args.ratio * len(content)) # Calculate split point
train_set = content[:split] # First portion = training
test_set = content[split:] # Remainder = test
# Output file names derived from input
train_name = args.in_file.replace(".json", "_train.json")
test_name = args.in_file.replace(".json", "_test.json")
json.dump(train_set, open(train_name, "w"), indent=2, ensure_ascii=False)
json.dump(test_set, open(test_name, "w"), indent=2, ensure_ascii=False)
Import
N/A -- This is an inline CLI script. There are no functions available for import. To replicate the logic programmatically, use the NumPy operations directly:
import numpy as np
np.random.seed(0)
perm = np.random.permutation(len(data))
shuffled = [data[i] for i in perm]
I/O Contract
Inputs
| Input | Type | Description |
|---|---|---|
| in_file | JSON file | Format-validated ShareGPT JSON: a list of conversation dicts. This is typically the output of the format filtering step. |
Outputs
| Output | Type | Naming Convention | Description |
|---|---|---|---|
| train file | JSON file | {input_prefix}_train.json |
Training subset containing ratio fraction of shuffled conversations
|
| test file | JSON file | {input_prefix}_test.json |
Test subset containing 1 - ratio fraction of shuffled conversations
|
Example: With input sharegpt_clean_lang_split.json and --ratio 0.99:
sharegpt_clean_lang_split_train.json-- 99% of conversationssharegpt_clean_lang_split_test.json-- 1% of conversations
Dependencies
| Package | Purpose |
|---|---|
| numpy | Random seed, permutation for reproducible shuffling |
Usage Examples
Pipeline Usage (from prepare_all.py)
python3 -m fastchat.data.split_train_test \
--in ~/datasets/sharegpt_20230521_4k_clean_lang_split.json \
--ratio 0.99
This produces:
~/datasets/sharegpt_20230521_4k_clean_lang_split_train.json~/datasets/sharegpt_20230521_4k_clean_lang_split_test.json
Default Ratio (90/10)
python3 -m fastchat.data.split_train_test --in sharegpt_processed.json
Uses the default ratio of 0.9, producing a 90/10 train/test split.
Verifying Reproducibility
import json
import numpy as np
# Run 1
content = json.load(open("data.json"))
np.random.seed(0)
perm1 = np.random.permutation(len(content))
# Run 2
np.random.seed(0)
perm2 = np.random.permutation(len(content))
# These are always identical
assert all(perm1 == perm2)
Related Pages
- Principle:Lm_sys_FastChat_Train_Test_Data_Splitting
- Principle:Lm_sys_FastChat_Train_Test_Data_Splitting -- The principle that this implementation realizes
- Implementation:Lm_sys_FastChat_Filter_Wrong_Format -- Previous pipeline step: format validation
- Implementation:Lm_sys_FastChat_Hardcoded_Questions_And_Merge -- Next pipeline step: identity data injection