Heuristic:OpenBMB UltraFeedback Score Parsing Robustness
| Knowledge Sources | |
|---|---|
| Domains | Annotation, LLMs, Debugging |
| Last Updated | 2026-02-08 06:00 GMT |
Overview
Defensive parsing strategies for handling inconsistent GPT-4 score output formats including "X/10" fractions and multi-format responses.
Description
GPT-4 does not always produce scores in a clean, consistent format. When asked to provide a score from 1-10, it may respond with "7", "7/10", "7.5", or other variations. The UltraFeedback annotation pipeline includes multiple layers of defensive parsing to handle these inconsistencies. The critique annotator splits on `"\nOverall Score: "` and handles the "/" format. The score correction pipeline further handles fractional scores and uses `float(eval())` to parse numeric strings. The preference annotator uses regex patterns to extract numeric ratings from structured responses.
Usage
Use this heuristic whenever parsing numeric scores from LLM output. GPT-4 and other LLMs frequently deviate from the requested output format, even with explicit formatting instructions. Always implement defensive parsing with fallback strategies rather than assuming strict format compliance.
The Insight (Rule of Thumb)
- Action: Implement multi-layer score parsing: (1) split on known delimiters, (2) handle "X/Y" fraction format, (3) use regex to extract digits, (4) apply `eval()` or `float()` as final fallback.
- Value: At least 2628 completions out of the full dataset had the score=10 anomaly, demonstrating the scale of format inconsistency.
- Trade-off: Using `eval()` on LLM output is a security risk in untrusted contexts. The code uses `float(eval(response.strip()))` which could execute arbitrary Python if GPT-4 returned malicious strings. In this controlled context, GPT-4 output is trusted.
Reasoning
LLMs are stochastic text generators, not structured data producers. Even with explicit format instructions like "score from 1 to 10", the output can include explanatory text, fraction notation, or unexpected formatting. The UltraFeedback project discovered this when 2628 completions received `overall_score=10` that should have been `1` — the parsing logic was splitting on "." which confused "10" (the number) with decimal points. The corrected code handles the "/" format explicitly and uses fine-grained scores as a cross-validation signal.
Code Evidence
Score parsing with "/" handling from `annotate_critique.py:82-84`:
critique, score = response[0].strip(), response[1].split(".")[0].strip()
example["completions"][i]["critique"] = critique
example["completions"][i]["overall_score"] = score if "/" not in score else float(eval(score.split("/")[0].strip()))
Score re-annotation parsing from `fix_overall_score_issue.py:94-96`:
if "/" in response:
response = response.split("/")[0].strip()
score = float(eval(response.strip()))
Regex-based rating extraction from `annotate_preference.py:24-29`:
if aspect in ["instruction_following", "honesty"]:
pattern = r"Rating: (.+?)\nRationale: (.+)"
for response in responses:
matches = re.search(pattern, response, re.DOTALL)
annotation.append({
"Rating": re.findall(r'\b\d+\b', matches.group(1))[0] if matches.group(1) != "N/A" else "N/A",