Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:OpenBMB UltraFeedback Score Validation and Correction

From Leeroopedia


Knowledge Sources
Domains NLP, Data_Quality, Preference_Learning
Last Updated 2023-10-02 00:00 GMT

Overview

A post-hoc data quality correction strategy that identifies and remediates anomalous overall_score=10 ratings by cross-referencing with fine-grained aspect ratings.

Description

Score Validation and Correction addresses a specific data quality issue discovered in the UltraFeedback dataset: 2,628 completions received an overall_score of 10, which is outside the documented 1-10 range (likely caused by GPT-4 outputting "10" when the score parsing logic expected single digits, or by format inconsistencies).

The correction strategy uses the fine-grained aspect ratings (from the preference annotation pass) as a cross-reference signal. For each completion with overall_score=10, the pipeline:

  1. Calculates the average fine-grained rating across all annotated aspects (instruction_following, honesty, truthfulness, helpfulness)
  2. Applies a three-way triage based on the average:
    • Fine-grained average ≤ 2 → Flip to 1 (clearly bad completion, score=10 was erroneous)
    • Fine-grained average ≤ 4 → Re-annotate via GPT-4 (ambiguous case, ask GPT-4 for a single-digit score)
    • Fine-grained average > 4 → Keep as 10 (legitimately good completion)

The re-annotation uses a modified feedback prompt that includes the original critique and asks GPT-4 for just a single-token score (max_tokens=1), providing a more reliable assessment.

Usage

Use this principle when you have multi-signal annotation data and discover systematic scoring anomalies. The cross-referencing approach leverages independent annotation dimensions to validate questionable holistic scores.

Theoretical Basis

The correction leverages the consistency assumption: if multiple independent evaluation dimensions rate a completion poorly (average ≤ 2), the holistic score of 10 is almost certainly erroneous. Conversely, if the fine-grained ratings are high (> 4), the score of 10 may reflect genuinely excellent completions.

The ambiguous middle range (2-4) requires re-evaluation because the fine-grained signal is mixed.

Decision Logic:

# Abstract algorithm
def triage_score_10(completion):
    avg_rating = mean(completion.fine_grained_ratings)
    if avg_rating <= 2:
        return "flip_to_1"      # Clearly erroneous
    elif avg_rating <= 4:
        return "re_annotate"    # Ask GPT-4 again with max_tokens=1
    else:
        return "keep_as_10"     # Legitimately good

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment