Implementation:ContextualAI HALOs Dataset Loaders
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-08 03:00 GMT |
Overview
Concrete tool for loading and normalizing preference, binary, and SFT datasets provided by the HALOs data module.
Description
The train/data.py module defines the Example dataclass and Dataset collection class, along with 12+ get_{name} loader functions that parse datasets from HuggingFace Hub or local JSON files into the common Example schema. Supported datasets include SHP, Anthropic HH, UltraFeedback, OASST, UltraBin, AlpacaEval, SafeRLHF, WildBench, and s1K. Additional loaders handle sampled data (get_sampled_data) and labeled feedback (get_feedback) produced by the online alignment loop.
Usage
Import and call get_{name}(split) to load any supported dataset. The DataLoader classes in train/dataloader.py call these functions internally when initialized with dataset names.
Code Reference
Source Location
- Repository: ContextualAI/HALOs
- File: train/data.py
- Lines: L37-698
Signature
@dataclass
class Example:
prompt: List = field(default_factory=list)
prompt_id: int = -1
generations: List = field(default_factory=list)
sft_index: int = -1
scores: List[float] = field(default_factory=list)
pairs: List[Tuple[int, int]] = field(default_factory=list)
desirable: List[bool] = field(default_factory=list)
dataset_name: str = ''
original_prompt: str = ''
class Dataset:
def __init__(self, name: str):
self.name = name
self.data = defaultdict(Example)
# Representative loader function signature:
def get_shp(split: str, human_prefix: str = 'user', human_suffix: str = '',
assistant_prefix: str = 'assistant', assistant_suffix: str = '') -> Dataset:
"""Load the Stanford Human Preferences dataset."""
...
def get_sampled_data(split: str, ...) -> Dataset:
"""Load data that was sampled using train.sample."""
...
def get_feedback(split: str, ...) -> Dataset:
"""Load labeled feedback data (pairwise or binary) from a JSON file."""
...
Import
from train.data import Example, Dataset
from train import data as data_module
# Dynamic dispatch: getattr(data_module, f'get_{name}')(split)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| split | str | Yes | One of 'train' or 'test' |
| dataset name | str | Yes | Resolved via get_{name} dispatch (e.g., 'shp', 'hh', 'ultrabin', 'alpacaeval')
|
| JSON file path | str | No | For local data: path used as dataset name, parsed by get_feedback or get_sampled_data |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dataset | Collection of Example objects indexed by prompt hash |
| Example.prompt | List[Dict] | Multi-turn conversation with 'role' and 'content' keys |
| Example.generations | List[List[Dict]] | Candidate responses as lists of turns |
| Example.pairs | List[Tuple[int, int]] | Preference pairs (preferred_idx, dispreferred_idx) |
| Example.desirable | List[bool] | Binary labels per generation |
| Example.scores | List[float] | Scalar scores per generation |
| Example.sft_index | int | Index of the SFT target generation |
Usage Examples
Loading a HuggingFace Dataset
from train.data import Dataset
from train import data as data_module
# Load Stanford Human Preferences for training
dataset = data_module.get_shp('train')
# Access an example by its prompt hash
for prompt_id, example in dataset.data.items():
print(f"Prompt: {example.prompt[0]['content'][:100]}...")
print(f"Num generations: {example.num_generations()}")
print(f"Preference pairs: {example.pairs}")
break
Loading Online Feedback Data
from train import data as data_module
# Load pairwise feedback from the online alignment loop
dataset = data_module.get_feedback('train', path='feedback_round1.json')
for prompt_id, example in dataset.data.items():
print(f"Pairs: {example.pairs}")
print(f"Desirable: {example.desirable}")
break