Principle:OpenBMB UltraFeedback Instruction Sampling
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Construction, Preference_Learning |
| Last Updated | 2023-10-02 00:00 GMT |
Overview
A data curation strategy that aggregates instructions from diverse NLP task sources to create a broad-coverage seed corpus for preference dataset construction.
Description
Instruction Sampling is the first stage of preference dataset construction. It involves loading pre-prepared instruction datasets from multiple heterogeneous sources, each contributing different task types and difficulty levels. In the UltraFeedback pipeline, six instruction sources are used: UltraChat (multi-turn dialogue), ShareGPT (real user conversations), FLAN (academic NLP tasks), Evol-Instruct (complexity-evolved instructions), TruthfulQA (adversarial truthfulness probes), and FalseQA (false-premise questions). The diversity of sources ensures the resulting preference dataset covers a wide spectrum of instruction-following capabilities.
Each source is stored as a JSON file and loaded into a HuggingFace Dataset object for downstream processing. The source identity (subset name) is preserved throughout the pipeline because it determines which principle distribution and world knowledge context apply in later stages.
Usage
Use this principle when constructing preference datasets that require broad instruction coverage. It is the entry point for any pipeline that generates multi-model completions and then annotates them for preference learning. The choice of instruction sources directly impacts the diversity and quality of the final preference pairs.
Theoretical Basis
The theoretical motivation comes from the observation that LLM alignment benefits from training on diverse instruction types. A preference dataset biased toward a single task type (e.g., only chat) produces models that are poorly calibrated on factual, reasoning, or safety tasks.
Pseudo-code Logic:
# Abstract algorithm
for each source in [ultrachat, sharegpt, flan, evol_instruct, truthful_qa, false_qa]:
instructions = load_json(source_path)
dataset = create_hf_dataset(instructions)
# Preserve subset identity for downstream principle selection
yield dataset, source_name
The key design decisions are:
- Source diversity: Six sources spanning dialogue, academic NLP, adversarial probes, and evolved instructions
- Flat loading: All sources are loaded as flat JSON, normalized to a common schema with an instruction field
- Subset-aware processing: The subset name propagates through the pipeline to condition principle sampling and world knowledge injection