Principle:Lm sys FastChat Train Test Data Splitting
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | Train Test Data Splitting |
| Repository | lm-sys/FastChat |
| Knowledge Sources | Source Code Analysis, API Documentation |
| Domains | Data Preprocessing, NLP Pipeline, Model Evaluation |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Train Test Data Splitting is a fundamental data preparation principle in the FastChat ShareGPT Data Pipeline that governs how the cleaned, filtered, and split conversation dataset is divided into training and test subsets. Proper separation of training and evaluation data is essential for assessing model performance without data leakage.
Description
Random Shuffling with Fixed Seed
Before splitting, the entire dataset is randomly shuffled using a fixed random seed (seed=0). This ensures that:
- The shuffle is reproducible: running the split multiple times on the same input always produces the same train/test partition.
- The distribution of conversations across the split is uniform: no systematic ordering in the input data (e.g., conversations sorted by date or topic) biases the split.
The implementation uses NumPy's np.random.permutation to generate a random permutation of indices, which provides efficient O(n) shuffling with well-studied statistical properties.
Configurable Split Ratio
The split ratio determines what fraction of the data goes to training versus testing. The ratio parameter specifies the training fraction:
- A ratio of 0.9 means 90% training, 10% test (the module's default)
- A ratio of 0.99 means 99% training, 1% test (the pipeline's default, as used in
prepare_all.py)
The high ratio (0.99) used in the Vicuna pipeline reflects the practical reality that:
- ShareGPT conversations are precious training data, and maximizing the training set size is important for model quality
- A 1% test set is sufficient for basic loss monitoring and sanity checks during fine-tuning
- Comprehensive model evaluation typically uses separate benchmark datasets (e.g., MT-Bench) rather than a held-out split
Importance of Held-Out Test Data
Even with a small test fraction, maintaining a held-out set serves several critical purposes:
- Overfitting detection: Comparing training loss to test loss reveals whether the model is memorizing training examples rather than learning generalizable patterns.
- Hyperparameter validation: When tuning learning rate, batch size, or number of epochs, the test set provides an unbiased performance estimate.
- Reproducibility: A fixed test set allows consistent comparison across different training runs and model configurations.
Output Naming Convention
The split produces two output files with names derived from the input file by replacing the .json extension with _train.json and _test.json. This naming convention makes it easy to identify paired train/test files and trace them back to their source.
Usage
In the standard FastChat pipeline, train/test splitting is the fifth step, applied after format validation:
python3 -m fastchat.data.split_train_test \
--in sharegpt_clean_lang_split.json \
--ratio 0.99
This produces:
sharegpt_clean_lang_split_train.json(99% of conversations)sharegpt_clean_lang_split_test.json(1% of conversations)
Theoretical Basis
Train/test splitting is one of the most fundamental practices in machine learning, grounded in statistical learning theory:
- Generalization error estimation: The test set provides an unbiased estimate of the model's performance on unseen data, which is the true measure of a model's utility. Without a test set, reported performance metrics are optimistically biased.
- Fixed seed reproducibility: Using a deterministic seed (
np.random.seed(0)) ensures that experiments are reproducible. This is critical for scientific rigor -- different researchers should be able to independently verify results using the same data split. - Random shuffling necessity: If the data has any inherent ordering (chronological, topical, or by source), a non-shuffled split would create systematically different distributions in the train and test sets, violating the i.i.d. (independent and identically distributed) assumption.
- Split ratio trade-offs: Larger training sets improve model performance but reduce the reliability of test set estimates. The 99/1 split used in the Vicuna pipeline prioritizes training data volume, which is appropriate when external evaluation benchmarks are available for more rigorous assessment.
Related Pages
- Implementation:Lm_sys_FastChat_Split_Train_Test
- Implementation:Lm_sys_FastChat_Split_Train_Test -- The implementation that realizes this principle
- Principle:Lm_sys_FastChat_Conversation_Format_Validation -- Previous pipeline stage: format validation
- Principle:Lm_sys_FastChat_Identity_Data_Injection -- Next pipeline stage: identity data injection and merging