Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Princeton nlp SimPO Dataset Loading and Mixing

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-08 04:30 GMT

Overview

A data loading pattern that combines multiple preference datasets at specified proportions into unified train/test splits.

Description

Preference optimization methods like SimPO require datasets containing paired responses — a chosen (preferred) response and a rejected (dispreferred) response to the same prompt. Dataset loading and mixing addresses the practical need to combine multiple data sources at different proportions. For example, one might mix 100% of an ultrafeedback dataset with 50% of a custom dataset. The mixer loads each dataset from HuggingFace Hub or local disk, subsamples the training split according to the specified fraction, and concatenates results. Test splits are never subsampled to ensure fair comparison.

Usage

Use this principle when preparing data for any preference optimization training run. It is the data ingestion step that precedes chat template application and tokenization. The dataset mixer pattern is especially useful when experimenting with different data compositions.

Theoretical Basis

The mixing algorithm follows a proportional sampling approach:

  1. For each dataset in the mixer, load the specified split
  2. For training: subsample to frac * len(dataset) examples
  3. For test: use the full dataset (no subsampling)
  4. Concatenate all subsampled training sets and all test sets
  5. Optionally shuffle with a fixed seed for reproducibility

Pseudo-code:

# Abstract algorithm (NOT real implementation)
for dataset_name, fraction in dataset_mixer.items():
    train_data = load(dataset_name, split="train")
    train_subset = train_data[:int(fraction * len(train_data))]
    train_datasets.append(train_subset)

combined_train = concatenate(train_datasets)
combined_train = shuffle(combined_train, seed=42)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment