Principle:OpenBMB UltraFeedback Score Validation and Correction

Knowledge Sources	UltraFeedback UltraFeedback
Domains	NLP, Data_Quality, Preference_Learning
Last Updated	2023-10-02 00:00 GMT

Overview

A post-hoc data quality correction strategy that identifies and remediates anomalous overall_score=10 ratings by cross-referencing with fine-grained aspect ratings.

Description

Score Validation and Correction addresses a specific data quality issue discovered in the UltraFeedback dataset: 2,628 completions received an overall_score of 10, which is outside the documented 1-10 range (likely caused by GPT-4 outputting "10" when the score parsing logic expected single digits, or by format inconsistencies).

The correction strategy uses the fine-grained aspect ratings (from the preference annotation pass) as a cross-reference signal. For each completion with overall_score=10, the pipeline:

Calculates the average fine-grained rating across all annotated aspects (instruction_following, honesty, truthfulness, helpfulness)
Applies a three-way triage based on the average:

- Fine-grained average ≤ 2 → Flip to 1 (clearly bad completion, score=10 was erroneous)
- Fine-grained average ≤ 4 → Re-annotate via GPT-4 (ambiguous case, ask GPT-4 for a single-digit score)
- Fine-grained average > 4 → Keep as 10 (legitimately good completion)

The re-annotation uses a modified feedback prompt that includes the original critique and asks GPT-4 for just a single-token score (max_tokens=1), providing a more reliable assessment.

Usage

Use this principle when you have multi-signal annotation data and discover systematic scoring anomalies. The cross-referencing approach leverages independent annotation dimensions to validate questionable holistic scores.

Theoretical Basis

The correction leverages the consistency assumption: if multiple independent evaluation dimensions rate a completion poorly (average ≤ 2), the holistic score of 10 is almost certainly erroneous. Conversely, if the fine-grained ratings are high (> 4), the score of 10 may reflect genuinely excellent completions.

The ambiguous middle range (2-4) requires re-evaluation because the fine-grained signal is mixed.

Decision Logic:

# Abstract algorithm
def triage_score_10(completion):
    avg_rating = mean(completion.fine_grained_ratings)
    if avg_rating <= 2:
        return "flip_to_1"      # Clearly erroneous
    elif avg_rating <= 4:
        return "re_annotate"    # Ask GPT-4 again with max_tokens=1
    else:
        return "keep_as_10"     # Legitimately good

Related Pages

Implemented By

Implementation:OpenBMB_UltraFeedback_Score_Correction_Pipeline

Uses Heuristic

Heuristic:OpenBMB_UltraFeedback_Score_10_Anomaly_Correction

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment