Implementation:OpenBMB UltraFeedback Score Correction Pipeline
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Quality |
| Last Updated | 2023-10-02 00:00 GMT |
Overview
Concrete tool for detecting and correcting anomalous overall_score=10 values in the UltraFeedback dataset using fine-grained rating cross-reference.
Description
The fix_overall_score_issue.py module provides four core functions:
calculate_average_rating(annotations): Extracts numeric 'Rating' values from all aspect annotations, filtering out "N/A" entries, and computes the mean.
check_score(completion): Applies the three-way triage based on the fine-grained average: returns 2 (flip to 1) if average ≤ 2, returns 1 (re-annotate) if average ≤ 4, returns 0 (keep) if average > 4.
get_eval(model, sys_prompt, user_prompt): Calls GPT-4 with max_tokens=1 for single-digit score re-annotation. Uses the same retry logic as other annotation scripts.
process_completions(example): Main processing function that iterates over completions, computes fine-grained scores, triages score=10 entries, and applies corrections (flip or re-annotate). Tracks global statistics in count_global.
Usage
Run as a standalone script that loads the published openbmb/UltraFeedback dataset from HuggingFace, processes all completions, and saves the corrected dataset to disk.
Code Reference
Source Location
- Repository: UltraFeedback
- File: src/data_annotation/fix_overall_score_issue.py (Lines 38-115)
Signature
def calculate_average_rating(annotations: Dict[str, Any]) -> Optional[float]:
"""Computes mean of numeric Rating values across all aspects.
Args:
annotations: Dict mapping aspect names to annotation dicts with 'Rating' key
Returns:
Mean rating as float, or None if no valid ratings found
"""
ratings = [int(aspect['Rating']) for aspect in annotations.values()
if 'Rating' in aspect and aspect['Rating'] != "N/A"]
return sum(ratings) / len(ratings) if ratings else None
def check_score(completion: Dict) -> int:
"""Triages a completion with overall_score=10.
Args:
completion: Dict with 'fine-grained_score' field
Returns:
0 = keep as 10, 1 = re-annotate, 2 = flip to 1
"""
if completion["fine-grained_score"] <= 2:
return 2 # flip
elif completion["fine-grained_score"] <= 4:
return 1 # re-annotate
else:
return 0 # remain
def process_completions(example: Dict) -> Dict:
"""Processes all completions in an example, correcting score=10 anomalies.
Args:
example: Dict with 'instruction' and 'completions' fields
Returns:
example: Same dict with corrected overall_score values
Side effects:
Updates global count_global dict tracking {0: kept, 1: re-annotated, 2: flipped}
"""
...
def get_eval(model: str, sys_prompt: str, user_prompt: str) -> str:
"""Calls GPT-4 with max_tokens=1 for single-digit re-annotation.
Args:
model: Model name (e.g., "gpt-4-0613")
sys_prompt: System prompt
user_prompt: Feedback prompt including original critique
Returns:
Single character/digit response from GPT-4
"""
...
Import
from typing import List, Dict, Optional, Any
from datasets import load_dataset
import openai
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataset | datasets.Dataset | Yes | Published UltraFeedback dataset from HuggingFace ("openbmb/UltraFeedback") |
| example["completions"][i]["overall_score"] | Union[int, float] | Yes | Original overall score (targeting score=10 entries) |
| example["completions"][i]["annotations"] | Dict | Yes | Fine-grained aspect annotations with Rating fields |
| example["completions"][i]["critique"] | str | Yes | Original critique text (used in re-annotation prompt) |
Outputs
| Name | Type | Description |
|---|---|---|
| corrected dataset | datasets.Dataset | Dataset with corrected overall_score values, saved to disk via save_to_disk("UltraFeedback") |
| count_global | Dict[int, int] | Statistics: {0: kept_count, 1: re_annotated_count, 2: flipped_count} |
Usage Examples
Full Correction Pipeline
from datasets import load_dataset
# Load published dataset
dataset = load_dataset("openbmb/UltraFeedback")["train"]
# Track correction statistics
count_global = {0: 0, 1: 0, 2: 0}
# Process all examples
dataset = dataset.map(process_completions, load_from_cache_file=False)
# Report results
print(count_global)
# Example: {0: 1847, 1: 412, 2: 369}
# "2628 completions with score=10: 1847 kept, 412 re-annotated, 369 flipped"
# Save corrected dataset
dataset.save_to_disk("UltraFeedback")
Individual Functions
# Calculate fine-grained average
annotations = {
"instruction_following": {"Rating": "4", "Rationale": "..."},
"honesty": {"Rating": "3", "Rationale": "..."},
"truthfulness": {"Rating": "2", "Rationale": "..."},
"helpfulness": {"Rating": "3", "Rationale": "..."},
}
avg = calculate_average_rating(annotations) # Returns 3.0
# Triage a completion
completion = {"fine-grained_score": 3.0, "overall_score": 10}
flag = check_score(completion) # Returns 1 (re-annotate)
Related Pages
Implements Principle
Requires Environment
- Environment:OpenBMB_UltraFeedback_OpenAI_API_Environment
- Environment:OpenBMB_UltraFeedback_HuggingFace_Hub_Environment