Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:OpenBMB UltraFeedback Score Correction Pipeline

From Leeroopedia


Knowledge Sources
Domains NLP, Data_Quality
Last Updated 2023-10-02 00:00 GMT

Overview

Concrete tool for detecting and correcting anomalous overall_score=10 values in the UltraFeedback dataset using fine-grained rating cross-reference.

Description

The fix_overall_score_issue.py module provides four core functions:

calculate_average_rating(annotations): Extracts numeric 'Rating' values from all aspect annotations, filtering out "N/A" entries, and computes the mean.

check_score(completion): Applies the three-way triage based on the fine-grained average: returns 2 (flip to 1) if average ≤ 2, returns 1 (re-annotate) if average ≤ 4, returns 0 (keep) if average > 4.

get_eval(model, sys_prompt, user_prompt): Calls GPT-4 with max_tokens=1 for single-digit score re-annotation. Uses the same retry logic as other annotation scripts.

process_completions(example): Main processing function that iterates over completions, computes fine-grained scores, triages score=10 entries, and applies corrections (flip or re-annotate). Tracks global statistics in count_global.

Usage

Run as a standalone script that loads the published openbmb/UltraFeedback dataset from HuggingFace, processes all completions, and saves the corrected dataset to disk.

Code Reference

Source Location

  • Repository: UltraFeedback
  • File: src/data_annotation/fix_overall_score_issue.py (Lines 38-115)

Signature

def calculate_average_rating(annotations: Dict[str, Any]) -> Optional[float]:
    """Computes mean of numeric Rating values across all aspects.
    Args:
        annotations: Dict mapping aspect names to annotation dicts with 'Rating' key
    Returns:
        Mean rating as float, or None if no valid ratings found
    """
    ratings = [int(aspect['Rating']) for aspect in annotations.values()
               if 'Rating' in aspect and aspect['Rating'] != "N/A"]
    return sum(ratings) / len(ratings) if ratings else None

def check_score(completion: Dict) -> int:
    """Triages a completion with overall_score=10.
    Args:
        completion: Dict with 'fine-grained_score' field
    Returns:
        0 = keep as 10, 1 = re-annotate, 2 = flip to 1
    """
    if completion["fine-grained_score"] <= 2:
        return 2  # flip
    elif completion["fine-grained_score"] <= 4:
        return 1  # re-annotate
    else:
        return 0  # remain

def process_completions(example: Dict) -> Dict:
    """Processes all completions in an example, correcting score=10 anomalies.
    Args:
        example: Dict with 'instruction' and 'completions' fields
    Returns:
        example: Same dict with corrected overall_score values
    Side effects:
        Updates global count_global dict tracking {0: kept, 1: re-annotated, 2: flipped}
    """
    ...

def get_eval(model: str, sys_prompt: str, user_prompt: str) -> str:
    """Calls GPT-4 with max_tokens=1 for single-digit re-annotation.
    Args:
        model: Model name (e.g., "gpt-4-0613")
        sys_prompt: System prompt
        user_prompt: Feedback prompt including original critique
    Returns:
        Single character/digit response from GPT-4
    """
    ...

Import

from typing import List, Dict, Optional, Any
from datasets import load_dataset
import openai

I/O Contract

Inputs

Name Type Required Description
dataset datasets.Dataset Yes Published UltraFeedback dataset from HuggingFace ("openbmb/UltraFeedback")
example["completions"][i]["overall_score"] Union[int, float] Yes Original overall score (targeting score=10 entries)
example["completions"][i]["annotations"] Dict Yes Fine-grained aspect annotations with Rating fields
example["completions"][i]["critique"] str Yes Original critique text (used in re-annotation prompt)

Outputs

Name Type Description
corrected dataset datasets.Dataset Dataset with corrected overall_score values, saved to disk via save_to_disk("UltraFeedback")
count_global Dict[int, int] Statistics: {0: kept_count, 1: re_annotated_count, 2: flipped_count}

Usage Examples

Full Correction Pipeline

from datasets import load_dataset

# Load published dataset
dataset = load_dataset("openbmb/UltraFeedback")["train"]

# Track correction statistics
count_global = {0: 0, 1: 0, 2: 0}

# Process all examples
dataset = dataset.map(process_completions, load_from_cache_file=False)

# Report results
print(count_global)
# Example: {0: 1847, 1: 412, 2: 369}
# "2628 completions with score=10: 1847 kept, 412 re-annotated, 369 flipped"

# Save corrected dataset
dataset.save_to_disk("UltraFeedback")

Individual Functions

# Calculate fine-grained average
annotations = {
    "instruction_following": {"Rating": "4", "Rationale": "..."},
    "honesty": {"Rating": "3", "Rationale": "..."},
    "truthfulness": {"Rating": "2", "Rationale": "..."},
    "helpfulness": {"Rating": "3", "Rationale": "..."},
}
avg = calculate_average_rating(annotations)  # Returns 3.0

# Triage a completion
completion = {"fine-grained_score": 3.0, "overall_score": 10}
flag = check_score(completion)  # Returns 1 (re-annotate)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment