Principle:Princeton nlp SimPO Dataset Loading and Mixing
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-08 04:30 GMT |
Overview
A data loading pattern that combines multiple preference datasets at specified proportions into unified train/test splits.
Description
Preference optimization methods like SimPO require datasets containing paired responses — a chosen (preferred) response and a rejected (dispreferred) response to the same prompt. Dataset loading and mixing addresses the practical need to combine multiple data sources at different proportions. For example, one might mix 100% of an ultrafeedback dataset with 50% of a custom dataset. The mixer loads each dataset from HuggingFace Hub or local disk, subsamples the training split according to the specified fraction, and concatenates results. Test splits are never subsampled to ensure fair comparison.
Usage
Use this principle when preparing data for any preference optimization training run. It is the data ingestion step that precedes chat template application and tokenization. The dataset mixer pattern is especially useful when experimenting with different data compositions.
Theoretical Basis
The mixing algorithm follows a proportional sampling approach:
- For each dataset in the mixer, load the specified split
- For training: subsample to frac * len(dataset) examples
- For test: use the full dataset (no subsampling)
- Concatenate all subsampled training sets and all test sets
- Optionally shuffle with a fixed seed for reproducibility
Pseudo-code:
# Abstract algorithm (NOT real implementation)
for dataset_name, fraction in dataset_mixer.items():
train_data = load(dataset_name, split="train")
train_subset = train_data[:int(fraction * len(train_data))]
train_datasets.append(train_subset)
combined_train = concatenate(train_datasets)
combined_train = shuffle(combined_train, seed=42)