Workflow:OpenBMB UltraFeedback GPT4 Preference Annotation
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Data_Annotation, Preference_Learning, RLHF |
| Last Updated | 2023-12-29 00:00 GMT |
Overview
End-to-end process for annotating LLM completions using GPT-4, producing both textual critiques with overall scores and fine-grained multi-aspect preference ratings across instruction-following, honesty, truthfulness, and helpfulness dimensions.
Description
This workflow describes the two-pass GPT-4 annotation pipeline that transforms raw LLM completions into a richly annotated preference dataset. The first pass generates free-form textual critiques and overall quality scores (1-10) for each individual completion. The second pass evaluates batches of four completions per instruction across four fine-grained aspects using detailed rubric templates, producing structured ratings (1-5) with rationale for each aspect. A final validation step corrects anomalous scores by cross-referencing overall scores against fine-grained ratings. The pipeline handles six instruction subsets with dataset-specific context injection (world knowledge for TruthfulQA, false premise flags for FalseQA, reference answers for FLAN).
Usage
Execute this workflow after completion generation is finished and you have JSON files containing instructions with their four model-generated completions. You need an OpenAI API key with GPT-4 access. The critique annotation should be run first, followed by the preference annotation. The score validation step can be run independently on the published HuggingFace dataset.
Execution Steps
Step 1: Load Completion Data
Load the JSON files containing instructions and their generated completions from the completion generation phase. Each file corresponds to one instruction subset (sharegpt, flan, evol_instruct, ultrachat, truthful_qa, false_qa). The data is loaded into a HuggingFace Dataset for iteration. Each record contains the instruction text, source dataset, assigned models, and an array of four completions with their model names, principles, and response text.
Key considerations:
- Data is loaded from the completion_data directory produced by the generation phase
- Each subset is processed independently
- The dataset is converted from JSON to pandas DataFrame to HuggingFace Dataset
Step 2: Critique Annotation
For each completion individually, send the instruction and response to GPT-4 with a structured critique prompt. GPT-4 acts as a teacher providing constructive feedback on helpfulness, truthfulness, honesty, and instruction-following. The response format requires a textual feedback section followed by an "Overall Score: [1-10]" line. The principle's system prompt is included as a "Note:" appended to the instruction to provide behavioral context for evaluation.
What happens:
- Each of the four completions per instruction receives a separate GPT-4 call
- The critique prompt explicitly prohibits providing reference answers
- GPT-4 is called with temperature=0, top_p=0.6, max_tokens=1024
- The response is split on "Overall Score: " to extract critique text and numeric score
- Score parsing handles fraction formats (e.g., "8/10") by extracting the numerator
- Results are appended directly to each completion's metadata
Step 3: Fine_grained Preference Annotation
For each instruction, present all four completions simultaneously to GPT-4 for comparative evaluation across four aspects: instruction-following, honesty, truthfulness, and helpfulness. Each aspect uses a dedicated rubric template defining rating scales, output format, and evaluation criteria. Completions are presented in a randomized order to mitigate positional bias, and the annotations are mapped back to the original ordering.
Key considerations:
- Four GPT-4 calls per instruction (one per aspect), each evaluating all four completions together
- Instruction-following uses a 1-5 scale measuring alignment with task goal and restrictions
- Honesty uses a 1-5 scale (plus N/A for creative tasks) measuring confidence calibration
- Truthfulness uses a 1-5 hallucination severity scale with hallucination type classification (factual error, instruction contradiction, self-contradiction)
- Helpfulness uses a 1-5 informativeness scale with type classification (clarity, comprehensiveness, conciseness)
- Truthfulness and helpfulness templates include world knowledge context when available
- Response parsing uses regex patterns specific to each aspect's output format
- Failed parses trigger re-annotation with up to 10 retries per aspect
Step 4: World Knowledge Injection
For instruction subsets with reference information, inject domain-specific context into the annotation templates. TruthfulQA instructions include subsets of correct and incorrect answers. FalseQA instructions include a flag indicating the question is based on a false premise. FLAN instructions include reference correct answers. All other subsets receive a "No additional world knowledge" placeholder.
Key considerations:
- World knowledge is only used by the truthfulness and helpfulness templates
- The knowledge is injected as a template variable, not as part of the instruction
- This context helps GPT-4 make more accurate factual assessments
Step 5: Score Validation and Correction
Identify completions with anomalous overall_score=10 by cross-referencing against their fine-grained aspect ratings. Calculate the average of all aspect ratings per completion. Completions with average fine-grained scores <=2 are directly corrected to overall_score=1 (clear low quality). Completions with average scores >4 are confirmed at 10 (legitimately high quality). Ambiguous cases (average 2-4) are re-annotated by GPT-4 using the original critique text as additional context, with max_tokens=1 to extract only the corrected score digit.
What happens:
- The dataset is loaded from the HuggingFace Hub (openbmb/UltraFeedback)
- 2,628 completions with overall_score=10 are evaluated
- Three correction categories: remain (score >4), flip to 1 (score <=2), re-annotate (score 2-4)
- Re-annotation includes the original critique text to maintain consistency
- The corrected dataset is saved to disk for redistribution
- Statistics are printed showing how many completions were remained, re-annotated, and flipped