Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Princeton nlp SimPO Get Datasets

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-08 04:30 GMT

Overview

Concrete tool for loading and mixing preference datasets with proportional sampling, provided by the SimPO alignment package.

Description

The get_datasets function is the main entry point for data loading. It extracts the dataset_mixer dictionary from DataArguments and delegates to mix_datasets, which handles the actual loading, subsampling, and concatenation. Datasets are loaded from HuggingFace Hub (via load_dataset) or local disk (via load_from_disk). Redundant columns are removed to avoid schema conflicts when concatenating datasets from different sources.

Usage

Import and call after parsing configuration. This is Step 3 of the SimPO training pipeline, immediately after configuration parsing and before chat template application.

Code Reference

Source Location

  • Repository: SimPO
  • File: alignment/data.py (Lines 125-256)

Signature

def get_datasets(
    data_config: DataArguments | dict,
    splits: Optional[List[str]] = None,
    configs: Optional[List[str]] = None,
    columns_to_keep: Optional[List[str]] = None,
    shuffle: bool = True,
) -> DatasetDict:
    """
    Loads one or more datasets with varying training set proportions.

    Args:
        data_config: Dataset configuration and split proportions.
        splits: Dataset splits to load and mix (default: ['train', 'test']).
        configs: List of dataset config names.
        columns_to_keep: Column names to keep in the dataset.
        shuffle: Whether to shuffle the data.

    Returns:
        DatasetDict with combined train and test splits.
    """

def mix_datasets(
    dataset_mixer: dict,
    splits: Optional[List[str]] = None,
    configs: Optional[List[str]] = None,
    columns_to_keep: Optional[List[str]] = None,
    shuffle: bool = True,
) -> DatasetDict:
    """
    Loads and mixes datasets according to proportions
    specified in dataset_mixer.
    """

Import

from alignment import get_datasets

I/O Contract

Inputs

Name Type Required Description
data_config DataArguments or dict Yes Contains dataset_mixer mapping dataset names to proportions
splits List[str] No Splits to load (default: ["train", "test"])
configs List[str] No Dataset config names (must match length of dataset_mixer)
columns_to_keep List[str] No Columns to retain (others removed to avoid schema conflicts)
shuffle bool No Whether to shuffle results (default: True, seed=42)

Outputs

Name Type Description
raw_datasets DatasetDict Dictionary with "train" and "test" splits, each a HuggingFace Dataset

Usage Examples

Loading SimPO Training Data

from alignment import get_datasets, DataArguments

# Configure the dataset mixer
data_args = DataArguments(
    dataset_mixer={"princeton-nlp/llama3-ultrafeedback": 1.0},
    dataset_splits=["train_prefs", "test_prefs"],
)

# Load and mix datasets
raw_datasets = get_datasets(
    data_args,
    splits=data_args.dataset_splits,
    columns_to_keep=["messages", "chosen", "rejected", "prompt", "completion", "label"],
)

print(raw_datasets)
# DatasetDict({
#     train: Dataset({features: ['chosen', 'rejected', 'prompt', ...], num_rows: ...})
#     test: Dataset({features: ['chosen', 'rejected', 'prompt', ...], num_rows: ...})
# })

Multi-Dataset Mixing

# Mix two datasets at different proportions
data_args = DataArguments(
    dataset_mixer={
        "princeton-nlp/llama3-ultrafeedback": 1.0,   # Use 100% of training set
        "another-org/preference-data": 0.5,           # Use 50% of training set
    },
)

raw_datasets = get_datasets(
    data_args,
    columns_to_keep=["chosen", "rejected", "prompt"],
)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment