Principle:ContextualAI HALOs Data Preparation
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-08 03:00 GMT |
Overview
A unified data abstraction that normalizes heterogeneous preference, binary feedback, and SFT datasets into a common schema for alignment training.
Description
LLM alignment requires datasets in several feedback formats: paired preferences (response A is better than response B), binary feedback (this response is desirable/undesirable), and SFT targets (generate this specific response). Different public datasets provide feedback in different formats and structures.
The data preparation principle addresses this heterogeneity by defining a universal Example dataclass that captures all feedback modalities in a single schema. Each Example contains a multi-turn prompt, a list of candidate generations, paired preference indices, binary desirability labels, scalar scores, and an SFT target index. Dataset-specific loader functions (get_{name}) parse raw data from HuggingFace or local JSON files into this common format.
This normalization enables downstream components (DataLoaders, Trainers) to operate on a single interface regardless of the original data source.
Usage
Use this principle whenever loading training or evaluation data for any HALOs workflow. It is the mandatory first data processing step for SFT training, preference alignment (DPO, KTO, GRPO, etc.), reward model training, and online iterative alignment.
Theoretical Basis
The core abstraction is the Example dataclass:
# Abstract schema (not implementation)
Example:
prompt: List[Turn] # Multi-turn conversation context
generations: List[List[Turn]] # Candidate responses
pairs: List[Tuple[int, int]] # (preferred_idx, dispreferred_idx) for paired feedback
desirable: List[bool] # Binary feedback per generation
scores: List[float] # Scalar scores per generation
sft_index: int # Index of the SFT target in generations
dataset_name: str # Source dataset identifier
A Dataset is a collection of Examples indexed by prompt hash, ensuring each unique prompt maps to exactly one Example (aggregating all its generations and feedback signals).
The loader dispatch pattern uses Python's naming convention: for any dataset name X, the system calls get_X(split) to obtain a Dataset object.