Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:OpenBMB UltraFeedback Score Parsing Robustness

From Leeroopedia



Knowledge Sources
Domains Annotation, LLMs, Debugging
Last Updated 2026-02-08 06:00 GMT

Overview

Defensive parsing strategies for handling inconsistent GPT-4 score output formats including "X/10" fractions and multi-format responses.

Description

GPT-4 does not always produce scores in a clean, consistent format. When asked to provide a score from 1-10, it may respond with "7", "7/10", "7.5", or other variations. The UltraFeedback annotation pipeline includes multiple layers of defensive parsing to handle these inconsistencies. The critique annotator splits on `"\nOverall Score: "` and handles the "/" format. The score correction pipeline further handles fractional scores and uses `float(eval())` to parse numeric strings. The preference annotator uses regex patterns to extract numeric ratings from structured responses.

Usage

Use this heuristic whenever parsing numeric scores from LLM output. GPT-4 and other LLMs frequently deviate from the requested output format, even with explicit formatting instructions. Always implement defensive parsing with fallback strategies rather than assuming strict format compliance.

The Insight (Rule of Thumb)

  • Action: Implement multi-layer score parsing: (1) split on known delimiters, (2) handle "X/Y" fraction format, (3) use regex to extract digits, (4) apply `eval()` or `float()` as final fallback.
  • Value: At least 2628 completions out of the full dataset had the score=10 anomaly, demonstrating the scale of format inconsistency.
  • Trade-off: Using `eval()` on LLM output is a security risk in untrusted contexts. The code uses `float(eval(response.strip()))` which could execute arbitrary Python if GPT-4 returned malicious strings. In this controlled context, GPT-4 output is trusted.

Reasoning

LLMs are stochastic text generators, not structured data producers. Even with explicit format instructions like "score from 1 to 10", the output can include explanatory text, fraction notation, or unexpected formatting. The UltraFeedback project discovered this when 2628 completions received `overall_score=10` that should have been `1` — the parsing logic was splitting on "." which confused "10" (the number) with decimal points. The corrected code handles the "/" format explicitly and uses fine-grained scores as a cross-validation signal.

Code Evidence

Score parsing with "/" handling from `annotate_critique.py:82-84`:

critique, score = response[0].strip(), response[1].split(".")[0].strip()
example["completions"][i]["critique"] = critique
example["completions"][i]["overall_score"] = score if "/" not in score else float(eval(score.split("/")[0].strip()))

Score re-annotation parsing from `fix_overall_score_issue.py:94-96`:

if "/" in response:
    response = response.split("/")[0].strip()
score = float(eval(response.strip()))

Regex-based rating extraction from `annotate_preference.py:24-29`:

if aspect in ["instruction_following", "honesty"]:
    pattern = r"Rating: (.+?)\nRationale: (.+)"
    for response in responses:
        matches = re.search(pattern, response, re.DOTALL)
        annotation.append({
            "Rating": re.findall(r'\b\d+\b', matches.group(1))[0] if matches.group(1) != "N/A" else "N/A",

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment