Workflow:OpenBMB UltraFeedback GPT4 Preference Annotation

Knowledge Sources	UltraFeedback UltraFeedback Paper HuggingFace Dataset
Domains	LLMs, Data_Annotation, Preference_Learning, RLHF
Last Updated	2023-12-29 00:00 GMT

Overview

End-to-end process for annotating LLM completions using GPT-4, producing both textual critiques with overall scores and fine-grained multi-aspect preference ratings across instruction-following, honesty, truthfulness, and helpfulness dimensions.

Description

This workflow describes the two-pass GPT-4 annotation pipeline that transforms raw LLM completions into a richly annotated preference dataset. The first pass generates free-form textual critiques and overall quality scores (1-10) for each individual completion. The second pass evaluates batches of four completions per instruction across four fine-grained aspects using detailed rubric templates, producing structured ratings (1-5) with rationale for each aspect. A final validation step corrects anomalous scores by cross-referencing overall scores against fine-grained ratings. The pipeline handles six instruction subsets with dataset-specific context injection (world knowledge for TruthfulQA, false premise flags for FalseQA, reference answers for FLAN).

Usage

Execute this workflow after completion generation is finished and you have JSON files containing instructions with their four model-generated completions. You need an OpenAI API key with GPT-4 access. The critique annotation should be run first, followed by the preference annotation. The score validation step can be run independently on the published HuggingFace dataset.

Execution Steps

Step 1: Load Completion Data

Load the JSON files containing instructions and their generated completions from the completion generation phase. Each file corresponds to one instruction subset (sharegpt, flan, evol_instruct, ultrachat, truthful_qa, false_qa). The data is loaded into a HuggingFace Dataset for iteration. Each record contains the instruction text, source dataset, assigned models, and an array of four completions with their model names, principles, and response text.

Key considerations:

Data is loaded from the completion_data directory produced by the generation phase
Each subset is processed independently
The dataset is converted from JSON to pandas DataFrame to HuggingFace Dataset

Step 2: Critique Annotation

For each completion individually, send the instruction and response to GPT-4 with a structured critique prompt. GPT-4 acts as a teacher providing constructive feedback on helpfulness, truthfulness, honesty, and instruction-following. The response format requires a textual feedback section followed by an "Overall Score: [1-10]" line. The principle's system prompt is included as a "Note:" appended to the instruction to provide behavioral context for evaluation.

What happens:

Each of the four completions per instruction receives a separate GPT-4 call
The critique prompt explicitly prohibits providing reference answers
GPT-4 is called with temperature=0, top_p=0.6, max_tokens=1024
The response is split on "Overall Score: " to extract critique text and numeric score
Score parsing handles fraction formats (e.g., "8/10") by extracting the numerator
Results are appended directly to each completion's metadata

Step 3: Fine_grained Preference Annotation

For each instruction, present all four completions simultaneously to GPT-4 for comparative evaluation across four aspects: instruction-following, honesty, truthfulness, and helpfulness. Each aspect uses a dedicated rubric template defining rating scales, output format, and evaluation criteria. Completions are presented in a randomized order to mitigate positional bias, and the annotations are mapped back to the original ordering.

Key considerations:

Four GPT-4 calls per instruction (one per aspect), each evaluating all four completions together
Instruction-following uses a 1-5 scale measuring alignment with task goal and restrictions
Honesty uses a 1-5 scale (plus N/A for creative tasks) measuring confidence calibration
Truthfulness uses a 1-5 hallucination severity scale with hallucination type classification (factual error, instruction contradiction, self-contradiction)
Helpfulness uses a 1-5 informativeness scale with type classification (clarity, comprehensiveness, conciseness)
Truthfulness and helpfulness templates include world knowledge context when available
Response parsing uses regex patterns specific to each aspect's output format
Failed parses trigger re-annotation with up to 10 retries per aspect

Step 4: World Knowledge Injection

For instruction subsets with reference information, inject domain-specific context into the annotation templates. TruthfulQA instructions include subsets of correct and incorrect answers. FalseQA instructions include a flag indicating the question is based on a false premise. FLAN instructions include reference correct answers. All other subsets receive a "No additional world knowledge" placeholder.

Key considerations:

World knowledge is only used by the truthfulness and helpfulness templates
The knowledge is injected as a template variable, not as part of the instruction
This context helps GPT-4 make more accurate factual assessments

Step 5: Score Validation and Correction

Identify completions with anomalous overall_score=10 by cross-referencing against their fine-grained aspect ratings. Calculate the average of all aspect ratings per completion. Completions with average fine-grained scores <=2 are directly corrected to overall_score=1 (clear low quality). Completions with average scores >4 are confirmed at 10 (legitimately high quality). Ambiguous cases (average 2-4) are re-annotated by GPT-4 using the original critique text as additional context, with max_tokens=1 to extract only the corrected score digit.

What happens:

The dataset is loaded from the HuggingFace Hub (openbmb/UltraFeedback)
2,628 completions with overall_score=10 are evaluated
Three correction categories: remain (score >4), flip to 1 (score <=2), re-annotate (score 2-4)
Re-annotation includes the original critique text to maintain consistency
The corrected dataset is saved to disk for redistribution
Statistics are printed showing how many completions were remained, re-annotated, and flipped

Execution Diagram

GitHub URL

Workflow Repository