Principle:OpenBMB UltraFeedback Score Validation and Correction
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Quality, Preference_Learning |
| Last Updated | 2023-10-02 00:00 GMT |
Overview
A post-hoc data quality correction strategy that identifies and remediates anomalous overall_score=10 ratings by cross-referencing with fine-grained aspect ratings.
Description
Score Validation and Correction addresses a specific data quality issue discovered in the UltraFeedback dataset: 2,628 completions received an overall_score of 10, which is outside the documented 1-10 range (likely caused by GPT-4 outputting "10" when the score parsing logic expected single digits, or by format inconsistencies).
The correction strategy uses the fine-grained aspect ratings (from the preference annotation pass) as a cross-reference signal. For each completion with overall_score=10, the pipeline:
- Calculates the average fine-grained rating across all annotated aspects (instruction_following, honesty, truthfulness, helpfulness)
- Applies a three-way triage based on the average:
- Fine-grained average ≤ 2 → Flip to 1 (clearly bad completion, score=10 was erroneous)
- Fine-grained average ≤ 4 → Re-annotate via GPT-4 (ambiguous case, ask GPT-4 for a single-digit score)
- Fine-grained average > 4 → Keep as 10 (legitimately good completion)
The re-annotation uses a modified feedback prompt that includes the original critique and asks GPT-4 for just a single-token score (max_tokens=1), providing a more reliable assessment.
Usage
Use this principle when you have multi-signal annotation data and discover systematic scoring anomalies. The cross-referencing approach leverages independent annotation dimensions to validate questionable holistic scores.
Theoretical Basis
The correction leverages the consistency assumption: if multiple independent evaluation dimensions rate a completion poorly (average ≤ 2), the holistic score of 10 is almost certainly erroneous. Conversely, if the fine-grained ratings are high (> 4), the score of 10 may reflect genuinely excellent completions.
The ambiguous middle range (2-4) requires re-evaluation because the fine-grained signal is mixed.
Decision Logic:
# Abstract algorithm
def triage_score_10(completion):
avg_rating = mean(completion.fine_grained_ratings)
if avg_rating <= 2:
return "flip_to_1" # Clearly erroneous
elif avg_rating <= 4:
return "re_annotate" # Ask GPT-4 again with max_tokens=1
else:
return "keep_as_10" # Legitimately good