Implementation:Princeton nlp SimPO Get Datasets
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-08 04:30 GMT |
Overview
Concrete tool for loading and mixing preference datasets with proportional sampling, provided by the SimPO alignment package.
Description
The get_datasets function is the main entry point for data loading. It extracts the dataset_mixer dictionary from DataArguments and delegates to mix_datasets, which handles the actual loading, subsampling, and concatenation. Datasets are loaded from HuggingFace Hub (via load_dataset) or local disk (via load_from_disk). Redundant columns are removed to avoid schema conflicts when concatenating datasets from different sources.
Usage
Import and call after parsing configuration. This is Step 3 of the SimPO training pipeline, immediately after configuration parsing and before chat template application.
Code Reference
Source Location
- Repository: SimPO
- File: alignment/data.py (Lines 125-256)
Signature
def get_datasets(
data_config: DataArguments | dict,
splits: Optional[List[str]] = None,
configs: Optional[List[str]] = None,
columns_to_keep: Optional[List[str]] = None,
shuffle: bool = True,
) -> DatasetDict:
"""
Loads one or more datasets with varying training set proportions.
Args:
data_config: Dataset configuration and split proportions.
splits: Dataset splits to load and mix (default: ['train', 'test']).
configs: List of dataset config names.
columns_to_keep: Column names to keep in the dataset.
shuffle: Whether to shuffle the data.
Returns:
DatasetDict with combined train and test splits.
"""
def mix_datasets(
dataset_mixer: dict,
splits: Optional[List[str]] = None,
configs: Optional[List[str]] = None,
columns_to_keep: Optional[List[str]] = None,
shuffle: bool = True,
) -> DatasetDict:
"""
Loads and mixes datasets according to proportions
specified in dataset_mixer.
"""
Import
from alignment import get_datasets
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_config | DataArguments or dict | Yes | Contains dataset_mixer mapping dataset names to proportions |
| splits | List[str] | No | Splits to load (default: ["train", "test"]) |
| configs | List[str] | No | Dataset config names (must match length of dataset_mixer) |
| columns_to_keep | List[str] | No | Columns to retain (others removed to avoid schema conflicts) |
| shuffle | bool | No | Whether to shuffle results (default: True, seed=42) |
Outputs
| Name | Type | Description |
|---|---|---|
| raw_datasets | DatasetDict | Dictionary with "train" and "test" splits, each a HuggingFace Dataset |
Usage Examples
Loading SimPO Training Data
from alignment import get_datasets, DataArguments
# Configure the dataset mixer
data_args = DataArguments(
dataset_mixer={"princeton-nlp/llama3-ultrafeedback": 1.0},
dataset_splits=["train_prefs", "test_prefs"],
)
# Load and mix datasets
raw_datasets = get_datasets(
data_args,
splits=data_args.dataset_splits,
columns_to_keep=["messages", "chosen", "rejected", "prompt", "completion", "label"],
)
print(raw_datasets)
# DatasetDict({
# train: Dataset({features: ['chosen', 'rejected', 'prompt', ...], num_rows: ...})
# test: Dataset({features: ['chosen', 'rejected', 'prompt', ...], num_rows: ...})
# })
Multi-Dataset Mixing
# Mix two datasets at different proportions
data_args = DataArguments(
dataset_mixer={
"princeton-nlp/llama3-ultrafeedback": 1.0, # Use 100% of training set
"another-org/preference-data": 0.5, # Use 50% of training set
},
)
raw_datasets = get_datasets(
data_args,
columns_to_keep=["chosen", "rejected", "prompt"],
)