Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Lm sys FastChat Train Test Data Splitting

From Leeroopedia


Field Value
Page Type Principle
Title Train Test Data Splitting
Repository lm-sys/FastChat
Knowledge Sources Source Code Analysis, API Documentation
Domains Data Preprocessing, NLP Pipeline, Model Evaluation
Last Updated 2026-02-07 14:00 GMT

Overview

Train Test Data Splitting is a fundamental data preparation principle in the FastChat ShareGPT Data Pipeline that governs how the cleaned, filtered, and split conversation dataset is divided into training and test subsets. Proper separation of training and evaluation data is essential for assessing model performance without data leakage.

Description

Random Shuffling with Fixed Seed

Before splitting, the entire dataset is randomly shuffled using a fixed random seed (seed=0). This ensures that:

  • The shuffle is reproducible: running the split multiple times on the same input always produces the same train/test partition.
  • The distribution of conversations across the split is uniform: no systematic ordering in the input data (e.g., conversations sorted by date or topic) biases the split.

The implementation uses NumPy's np.random.permutation to generate a random permutation of indices, which provides efficient O(n) shuffling with well-studied statistical properties.

Configurable Split Ratio

The split ratio determines what fraction of the data goes to training versus testing. The ratio parameter specifies the training fraction:

  • A ratio of 0.9 means 90% training, 10% test (the module's default)
  • A ratio of 0.99 means 99% training, 1% test (the pipeline's default, as used in prepare_all.py)

The high ratio (0.99) used in the Vicuna pipeline reflects the practical reality that:

  • ShareGPT conversations are precious training data, and maximizing the training set size is important for model quality
  • A 1% test set is sufficient for basic loss monitoring and sanity checks during fine-tuning
  • Comprehensive model evaluation typically uses separate benchmark datasets (e.g., MT-Bench) rather than a held-out split

Importance of Held-Out Test Data

Even with a small test fraction, maintaining a held-out set serves several critical purposes:

  • Overfitting detection: Comparing training loss to test loss reveals whether the model is memorizing training examples rather than learning generalizable patterns.
  • Hyperparameter validation: When tuning learning rate, batch size, or number of epochs, the test set provides an unbiased performance estimate.
  • Reproducibility: A fixed test set allows consistent comparison across different training runs and model configurations.

Output Naming Convention

The split produces two output files with names derived from the input file by replacing the .json extension with _train.json and _test.json. This naming convention makes it easy to identify paired train/test files and trace them back to their source.

Usage

In the standard FastChat pipeline, train/test splitting is the fifth step, applied after format validation:

python3 -m fastchat.data.split_train_test \
    --in sharegpt_clean_lang_split.json \
    --ratio 0.99

This produces:

  • sharegpt_clean_lang_split_train.json (99% of conversations)
  • sharegpt_clean_lang_split_test.json (1% of conversations)

Theoretical Basis

Train/test splitting is one of the most fundamental practices in machine learning, grounded in statistical learning theory:

  • Generalization error estimation: The test set provides an unbiased estimate of the model's performance on unseen data, which is the true measure of a model's utility. Without a test set, reported performance metrics are optimistically biased.
  • Fixed seed reproducibility: Using a deterministic seed (np.random.seed(0)) ensures that experiments are reproducible. This is critical for scientific rigor -- different researchers should be able to independently verify results using the same data split.
  • Random shuffling necessity: If the data has any inherent ordering (chronological, topical, or by source), a non-shuffled split would create systematically different distributions in the train and test sets, violating the i.i.d. (independent and identically distributed) assumption.
  • Split ratio trade-offs: Larger training sets improve model performance but reduce the reliability of test set estimates. The 99/1 split used in the Vicuna pipeline prioritizes training data volume, which is appropriate when external evaluation benchmarks are available for more rigorous assessment.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment