Heuristic:OpenBMB UltraFeedback Score 10 Anomaly Correction

Knowledge Sources	OpenBMB UltraFeedback overall_score issue #8
Domains	Data_Quality, Annotation, Debugging
Last Updated	2026-02-08 06:00 GMT

Overview

Cross-validation heuristic using fine-grained aspect scores to detect and correct anomalous `overall_score=10` ratings that should have been `1`.

Description

The initial UltraFeedback dataset contained 2628 completions with `overall_score=10`. Many of these were parsing artifacts where GPT-4 returned "1" that was misinterpreted as "10" due to the score parsing logic splitting on "." (confusing the number 10 with a decimal point). The correction heuristic uses the independently-collected fine-grained scores (instruction_following, honesty, truthfulness, helpfulness, each rated 1-5) as a cross-validation signal. If a completion has `overall_score=10` but fine-grained average <= 2, it is clearly low-quality and the score is flipped to 1. If the average is > 4, the score of 10 is plausible and kept. For the ambiguous middle range (2-4), the completion is re-annotated by GPT-4 with the original critique prepended to the prompt.

Usage

Use this heuristic when validating LLM-generated scores at scale. Always cross-validate scores from one evaluation dimension against independently-collected scores from other dimensions. Extreme scores (maximum or minimum) are especially suspicious and warrant automated validation.

The Insight (Rule of Thumb)

Action: For all completions with `overall_score == 10`, compare against the average of fine-grained aspect scores.
Value: Three tiers of action based on fine-grained average:
- Average <= 2: Flip overall_score to 1 (clearly low quality)
- Average <= 4: Re-annotate using GPT-4 with original critique context (ambiguous)
- Average > 4: Keep as 10 (plausibly high quality)
Trade-off: Re-annotation costs additional GPT-4 API calls. The threshold values (2 and 4 on a 1-5 scale) are heuristic cutoffs chosen based on the distribution of anomalous scores.

Reasoning

A completion with `overall_score=10` (the maximum) should consistently score well across all individual aspects. If the fine-grained average is <= 2 out of 5, it is statistically implausible that the overall score is genuinely 10 — this is almost certainly a parsing error where "1" was captured as "10". The middle range (2-4) is ambiguous: the completion may be mediocre overall but have one strong aspect that justifies a moderate-to-high overall score. Re-annotation with the original critique provides GPT-4 additional context to produce a more accurate score, and the `max_tokens=1` setting constrains the output to just the numeric score.

Code Evidence

Score checking logic from `fix_overall_score_issue.py:69-75`:

def check_score(completion):
    if completion["fine-grained_score"] <= 2:
        return 2 # should flip
    elif completion["fine-grained_score"] <= 4:
        return 1 # re-annotate
    else:
        return 0 # remain

Fine-grained average calculation from `fix_overall_score_issue.py:64-66`:

def calculate_average_rating(annotations):
    ratings = [int(aspect['Rating']) for aspect in annotations.values() if 'Rating' in aspect and aspect['Rating'] != "N/A"]
    return sum(ratings) / len(ratings) if ratings else None

Correction logic from `fix_overall_score_issue.py:83-97`:

if completion["overall_score"] == 10:
    flag = check_score(completion)
    count[flag] += 1
    if flag > 0:
        if flag == 2:
            completion["overall_score"] = 1
        elif flag == 1:
            # re-annotate
            custom_system_prompt = completion["custom_system_prompt"] if completion["principle"] != "verbalized_calibration" else completion["custom_system_prompt"].split("For instance, ")[0].strip()
            response = get_eval("gpt-4-0613", system_prompt, feedback_prompt.format(instruction="\n".join([example["instruction"], "Note: " + custom_system_prompt]), completion=completion["response"], critique=completion["critique"]))
            if "/" in response:
                response = response.split("/")[0].strip()
            score = float(eval(response.strip()))
            completion["overall_score"] = score

Re-annotation uses `max_tokens=1` from `fix_overall_score_issue.py:49`:

"max_tokens": 1,

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment