Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:OpenBMB UltraFeedback Dataset Construction

From Leeroopedia


Knowledge Sources
Domains LLMs, Data_Engineering, Preference_Learning, RLHF
Last Updated 2023-12-29 00:00 GMT

Overview

End-to-end process for constructing a large-scale, fine-grained preference dataset by sampling instructions, generating multi-model completions, and annotating them with GPT-4 across four quality aspects.

Description

This workflow outlines the complete UltraFeedback dataset construction pipeline. Starting from 63,967 instructions sampled across six public datasets (UltraChat, ShareGPT, Evol-Instruct, TruthfulQA, FalseQA, FLAN), it generates four completions per instruction using a diverse pool of 17 LLMs. Each completion is guided by a sampled behavioral principle (helpfulness, honesty, truthfulness, or verbalized calibration). The completions are then annotated in two passes: first a critique pass that produces textual feedback and an overall score (1-10), then a fine-grained preference pass that rates each completion across four aspects (instruction-following, honesty, truthfulness, helpfulness) with detailed rubrics. A final data quality step corrects erroneously scored completions, yielding 256k annotated samples suitable for training reward models and critique models.

Usage

Execute this workflow when you need to construct a preference dataset for RLHF (Reinforcement Learning from Human Feedback) or reward model training. The pipeline requires access to multiple LLM inference endpoints (both open-source and commercial), the OpenAI GPT-4 API for annotation, and source instruction datasets. The output is a structured JSON dataset with multi-aspect annotations suitable for training reward models (like UltraRM) or critique models (like UltraCM).

Execution Steps

Step 1: Instruction Sampling

Collect and sample instructions from six diverse public datasets to ensure broad coverage of instruction types. Each dataset contributes a controlled proportion: all instructions from TruthfulQA and FalseQA, stratified samples from FLAN (3k from CoT, 10 per task from other subsets), and random samples from Evol-Instruct (10k), UltraChat (10k), and ShareGPT (20k). This yields approximately 64k instructions total.

Key considerations:

  • Overly long FLAN instructions are excluded
  • Stratified sampling ensures diversity within FLAN subsets
  • Each instruction records its source dataset for downstream principle assignment

Step 2: Model Sampling

For each instruction, randomly sample 4 models from a pool of 17 diverse LLMs spanning commercial (GPT-4, GPT-3.5 Turbo, Bard), LLaMA-family (LLaMA-2 7B/13B/70B chat, UltraLM-13B/65B, WizardLM 7B/13B/70B, Vicuna-33B, Alpaca-7B), and non-LLaMA series (Falcon-40B-Instruct, MPT-30B-Chat, StarChat-Beta, Pythia-12B). The diversity of architectures, sizes, and training data prevents reward model overfitting to particular text styles.

Key considerations:

  • Models are sampled without replacement per instruction
  • The pool includes different base architectures (LLaMA, Falcon, MPT, StarChat) to maximize stylistic diversity
  • Model assignments are stored alongside each instruction for downstream processing

Step 3: Principle Sampling

For each completion, sample a behavioral principle from a dataset-specific distribution. Principles include helpfulness, harmlessness, honesty, truthfulness, and verbalized calibration. Each principle maps to a pool of ~11 system prompt variants (generated by GPT-4) that are randomly selected and injected into the model's system prompt to steer generation behavior.

What happens:

  • Evol-Instruct instructions always use the helpfulness principle
  • TruthfulQA/FalseQA instructions use honesty or truthfulness
  • ShareGPT/UltraChat use a 60/20/18/2 mix of helpfulness/truthfulness/honesty/verbalized calibration
  • FLAN uses a 60/20/0/20 mix of helpfulness/truthfulness/honesty/verbalized calibration
  • A concrete system prompt string is randomly chosen from the principle's prompt pool

Step 4: Completion Generation

Generate one response per assigned model for each instruction. The pipeline supports two inference backends: HuggingFace Transformers (sequential, per-model) and vLLM (batched, high-throughput). Each model's prompt is formatted according to its architecture-specific conversation template (e.g., LLaMA-2 uses special bracket tokens, Vicuna uses colon-separated turns, MPT uses ChatML format). Commercial models (GPT-4, GPT-3.5) are called via the OpenAI API.

Key considerations:

  • Prompt formatting uses the fastchat conversation template library with templates for 6+ architectures
  • Generation uses temperature=1.0 and top_p=1.0 for diversity
  • Maximum 1024 new tokens per completion
  • Model-specific stopping criteria prevent runaway generation

Step 5: Critique Annotation

Send each completion to GPT-4 for textual critique and an overall quality score (1-10). The critique prompt asks GPT-4 to act as a teacher, providing constructive feedback on helpfulness, truthfulness, honesty, and instruction-following, while avoiding reference answers. The response is parsed to extract the feedback text and the numeric overall score.

Key considerations:

  • GPT-4 is called with temperature=0 for consistency
  • The system prompt for principle-guided completions is included as context (except verbalized calibration prompts, which are truncated)
  • Score parsing handles edge cases like fraction formats (e.g., "8/10")
  • Each completion receives both textual critique and a scalar score

Step 6: Fine_grained Preference Annotation

Send batches of 4 completions (for the same instruction) to GPT-4 for per-aspect annotation across four dimensions: instruction-following, honesty, truthfulness, and helpfulness. Each aspect uses a detailed rubric template with specific rating scales (1-5) and structured output formats. Completions are presented in randomized order to mitigate position bias.

Key considerations:

  • Four separate GPT-4 calls per instruction (one per aspect)
  • Completion ordering is randomized to prevent positional bias
  • Truthfulness and helpfulness aspects include world knowledge context (correct/incorrect answers for TruthfulQA, false premise for FalseQA)
  • Response parsing extracts structured fields (Rating, Rationale, Type) using regex patterns
  • Failed parses trigger re-annotation (up to 10 retries)

Step 7: Score Validation and Correction

Identify and correct completions with anomalous overall_score=10 values. The original annotation contained 2,628 such cases. Each is validated against its fine-grained aspect ratings: completions with average fine-grained scores <=2 are flipped to score=1, those with scores >4 are left at 10, and ambiguous cases (scores 2-4) are re-annotated by GPT-4 using the original critique as context to produce a corrected score.

Key considerations:

  • The fix loads the dataset from HuggingFace Hub
  • Re-annotation uses max_tokens=1 to extract only the score digit
  • The corrected dataset is saved to disk for redistribution
  • This step is idempotent and can be re-run on the published dataset

Execution Diagram

GitHub URL

Workflow Repository