Heuristic:OpenBMB UltraFeedback Score 10 Anomaly Correction
| Knowledge Sources | |
|---|---|
| Domains | Data_Quality, Annotation, Debugging |
| Last Updated | 2026-02-08 06:00 GMT |
Overview
Cross-validation heuristic using fine-grained aspect scores to detect and correct anomalous `overall_score=10` ratings that should have been `1`.
Description
The initial UltraFeedback dataset contained 2628 completions with `overall_score=10`. Many of these were parsing artifacts where GPT-4 returned "1" that was misinterpreted as "10" due to the score parsing logic splitting on "." (confusing the number 10 with a decimal point). The correction heuristic uses the independently-collected fine-grained scores (instruction_following, honesty, truthfulness, helpfulness, each rated 1-5) as a cross-validation signal. If a completion has `overall_score=10` but fine-grained average <= 2, it is clearly low-quality and the score is flipped to 1. If the average is > 4, the score of 10 is plausible and kept. For the ambiguous middle range (2-4), the completion is re-annotated by GPT-4 with the original critique prepended to the prompt.
Usage
Use this heuristic when validating LLM-generated scores at scale. Always cross-validate scores from one evaluation dimension against independently-collected scores from other dimensions. Extreme scores (maximum or minimum) are especially suspicious and warrant automated validation.
The Insight (Rule of Thumb)
- Action: For all completions with `overall_score == 10`, compare against the average of fine-grained aspect scores.
- Value: Three tiers of action based on fine-grained average:
- Average <= 2: Flip overall_score to 1 (clearly low quality)
- Average <= 4: Re-annotate using GPT-4 with original critique context (ambiguous)
- Average > 4: Keep as 10 (plausibly high quality)
- Trade-off: Re-annotation costs additional GPT-4 API calls. The threshold values (2 and 4 on a 1-5 scale) are heuristic cutoffs chosen based on the distribution of anomalous scores.
Reasoning
A completion with `overall_score=10` (the maximum) should consistently score well across all individual aspects. If the fine-grained average is <= 2 out of 5, it is statistically implausible that the overall score is genuinely 10 — this is almost certainly a parsing error where "1" was captured as "10". The middle range (2-4) is ambiguous: the completion may be mediocre overall but have one strong aspect that justifies a moderate-to-high overall score. Re-annotation with the original critique provides GPT-4 additional context to produce a more accurate score, and the `max_tokens=1` setting constrains the output to just the numeric score.
Code Evidence
Score checking logic from `fix_overall_score_issue.py:69-75`:
def check_score(completion):
if completion["fine-grained_score"] <= 2:
return 2 # should flip
elif completion["fine-grained_score"] <= 4:
return 1 # re-annotate
else:
return 0 # remain
Fine-grained average calculation from `fix_overall_score_issue.py:64-66`:
def calculate_average_rating(annotations):
ratings = [int(aspect['Rating']) for aspect in annotations.values() if 'Rating' in aspect and aspect['Rating'] != "N/A"]
return sum(ratings) / len(ratings) if ratings else None
Correction logic from `fix_overall_score_issue.py:83-97`:
if completion["overall_score"] == 10:
flag = check_score(completion)
count[flag] += 1
if flag > 0:
if flag == 2:
completion["overall_score"] = 1
elif flag == 1:
# re-annotate
custom_system_prompt = completion["custom_system_prompt"] if completion["principle"] != "verbalized_calibration" else completion["custom_system_prompt"].split("For instance, ")[0].strip()
response = get_eval("gpt-4-0613", system_prompt, feedback_prompt.format(instruction="\n".join([example["instruction"], "Note: " + custom_system_prompt]), completion=completion["response"], critique=completion["critique"]))
if "/" in response:
response = response.split("/")[0].strip()
score = float(eval(response.strip()))
completion["overall_score"] = score
Re-annotation uses `max_tokens=1` from `fix_overall_score_issue.py:49`:
"max_tokens": 1,