Workflow:OpenBMB UltraFeedback Dataset Construction

Knowledge Sources	UltraFeedback UltraFeedback Paper HuggingFace Dataset
Domains	LLMs, Data_Engineering, Preference_Learning, RLHF
Last Updated	2023-12-29 00:00 GMT

Overview

End-to-end process for constructing a large-scale, fine-grained preference dataset by sampling instructions, generating multi-model completions, and annotating them with GPT-4 across four quality aspects.

Description

This workflow outlines the complete UltraFeedback dataset construction pipeline. Starting from 63,967 instructions sampled across six public datasets (UltraChat, ShareGPT, Evol-Instruct, TruthfulQA, FalseQA, FLAN), it generates four completions per instruction using a diverse pool of 17 LLMs. Each completion is guided by a sampled behavioral principle (helpfulness, honesty, truthfulness, or verbalized calibration). The completions are then annotated in two passes: first a critique pass that produces textual feedback and an overall score (1-10), then a fine-grained preference pass that rates each completion across four aspects (instruction-following, honesty, truthfulness, helpfulness) with detailed rubrics. A final data quality step corrects erroneously scored completions, yielding 256k annotated samples suitable for training reward models and critique models.

Usage

Execute this workflow when you need to construct a preference dataset for RLHF (Reinforcement Learning from Human Feedback) or reward model training. The pipeline requires access to multiple LLM inference endpoints (both open-source and commercial), the OpenAI GPT-4 API for annotation, and source instruction datasets. The output is a structured JSON dataset with multi-aspect annotations suitable for training reward models (like UltraRM) or critique models (like UltraCM).

Execution Steps

Step 1: Instruction Sampling

Collect and sample instructions from six diverse public datasets to ensure broad coverage of instruction types. Each dataset contributes a controlled proportion: all instructions from TruthfulQA and FalseQA, stratified samples from FLAN (3k from CoT, 10 per task from other subsets), and random samples from Evol-Instruct (10k), UltraChat (10k), and ShareGPT (20k). This yields approximately 64k instructions total.

Key considerations:

Overly long FLAN instructions are excluded
Stratified sampling ensures diversity within FLAN subsets
Each instruction records its source dataset for downstream principle assignment

Step 2: Model Sampling

For each instruction, randomly sample 4 models from a pool of 17 diverse LLMs spanning commercial (GPT-4, GPT-3.5 Turbo, Bard), LLaMA-family (LLaMA-2 7B/13B/70B chat, UltraLM-13B/65B, WizardLM 7B/13B/70B, Vicuna-33B, Alpaca-7B), and non-LLaMA series (Falcon-40B-Instruct, MPT-30B-Chat, StarChat-Beta, Pythia-12B). The diversity of architectures, sizes, and training data prevents reward model overfitting to particular text styles.

Key considerations:

Models are sampled without replacement per instruction
The pool includes different base architectures (LLaMA, Falcon, MPT, StarChat) to maximize stylistic diversity
Model assignments are stored alongside each instruction for downstream processing

Step 3: Principle Sampling

For each completion, sample a behavioral principle from a dataset-specific distribution. Principles include helpfulness, harmlessness, honesty, truthfulness, and verbalized calibration. Each principle maps to a pool of ~11 system prompt variants (generated by GPT-4) that are randomly selected and injected into the model's system prompt to steer generation behavior.

What happens:

Evol-Instruct instructions always use the helpfulness principle
TruthfulQA/FalseQA instructions use honesty or truthfulness
ShareGPT/UltraChat use a 60/20/18/2 mix of helpfulness/truthfulness/honesty/verbalized calibration
FLAN uses a 60/20/0/20 mix of helpfulness/truthfulness/honesty/verbalized calibration
A concrete system prompt string is randomly chosen from the principle's prompt pool

Step 4: Completion Generation

Generate one response per assigned model for each instruction. The pipeline supports two inference backends: HuggingFace Transformers (sequential, per-model) and vLLM (batched, high-throughput). Each model's prompt is formatted according to its architecture-specific conversation template (e.g., LLaMA-2 uses special bracket tokens, Vicuna uses colon-separated turns, MPT uses ChatML format). Commercial models (GPT-4, GPT-3.5) are called via the OpenAI API.

Key considerations:

Prompt formatting uses the fastchat conversation template library with templates for 6+ architectures
Generation uses temperature=1.0 and top_p=1.0 for diversity
Maximum 1024 new tokens per completion
Model-specific stopping criteria prevent runaway generation

Step 5: Critique Annotation

Send each completion to GPT-4 for textual critique and an overall quality score (1-10). The critique prompt asks GPT-4 to act as a teacher, providing constructive feedback on helpfulness, truthfulness, honesty, and instruction-following, while avoiding reference answers. The response is parsed to extract the feedback text and the numeric overall score.

Key considerations:

GPT-4 is called with temperature=0 for consistency
The system prompt for principle-guided completions is included as context (except verbalized calibration prompts, which are truncated)
Score parsing handles edge cases like fraction formats (e.g., "8/10")
Each completion receives both textual critique and a scalar score

Step 6: Fine_grained Preference Annotation

Send batches of 4 completions (for the same instruction) to GPT-4 for per-aspect annotation across four dimensions: instruction-following, honesty, truthfulness, and helpfulness. Each aspect uses a detailed rubric template with specific rating scales (1-5) and structured output formats. Completions are presented in randomized order to mitigate position bias.

Key considerations:

Four separate GPT-4 calls per instruction (one per aspect)
Completion ordering is randomized to prevent positional bias
Truthfulness and helpfulness aspects include world knowledge context (correct/incorrect answers for TruthfulQA, false premise for FalseQA)
Response parsing extracts structured fields (Rating, Rationale, Type) using regex patterns
Failed parses trigger re-annotation (up to 10 retries)

Step 7: Score Validation and Correction

Identify and correct completions with anomalous overall_score=10 values. The original annotation contained 2,628 such cases. Each is validated against its fine-grained aspect ratings: completions with average fine-grained scores <=2 are flipped to score=1, those with scores >4 are left at 10, and ambiguous cases (scores 2-4) are re-annotated by GPT-4 using the original critique as context to produce a corrected score.

Key considerations:

The fix loads the dataset from HuggingFace Hub
Re-annotation uses max_tokens=1 to extract only the score digit
The corrected dataset is saved to disk for redistribution
This step is idempotent and can be re-run on the published dataset

Execution Diagram

GitHub URL

Workflow Repository