Heuristic:OpenBMB UltraFeedback Position Bias Mitigation

Knowledge Sources	OpenBMB UltraFeedback UltraFeedback: Boosting Language Models with High-quality Feedback
Domains	Annotation, LLMs, Evaluation
Last Updated	2026-02-08 06:00 GMT

Overview

Randomized completion ordering technique to mitigate GPT-4's position bias when rating multiple completions.

Description

When GPT-4 evaluates multiple text completions side-by-side, it can exhibit position bias, systematically favoring or penalizing completions based on their presentation order rather than their actual quality. The UltraFeedback annotation pipeline mitigates this by generating multiple random permutations of the 4 completions and presenting each permutation to GPT-4 for independent evaluation. The `SHUFFLE_NUM` parameter controls how many distinct orderings are evaluated per instruction-aspect pair.

Usage

Use this heuristic whenever designing side-by-side LLM evaluation pipelines where a judge model rates multiple completions simultaneously. It is especially relevant for the GPT4_Preference_Annotator implementation, which evaluates 4 completions across 4 aspects (instruction_following, honesty, truthfulness, helpfulness). Without shuffling, ratings may be biased toward earlier or later positions.

The Insight (Rule of Thumb)

Action: Shuffle the order of completions presented to the judge model and evaluate multiple random orderings.
Value: `SHUFFLE_NUM = 1` is the current setting (single random ordering per aspect). Higher values (e.g., 3-5) would provide more robust debiasing at the cost of additional API calls.
Trade-off: Each additional shuffle multiplies GPT-4 API calls and cost. With 4 aspects and SHUFFLE_NUM=1, there are 4 API calls per instruction. SHUFFLE_NUM=3 would triple this to 12.
Implementation: Random permutations are generated ensuring no duplicates, then completions are reordered before presenting to GPT-4. Annotations are mapped back to the original completion indices after evaluation.

Reasoning

Position bias in LLM-as-judge evaluations is a well-documented phenomenon. Models tend to assign higher scores to earlier items in a list (primacy bias) or last items (recency bias). By randomizing the presentation order, the bias is distributed uniformly rather than systematically favoring specific completions. The annotation mapping (`completions[j]["annotations"][aspect].append(responses[order.index(j)])`) correctly remaps shuffled results back to the original completion positions, ensuring each completion accumulates ratings from its randomized position.

Code Evidence

SHUFFLE_NUM constant and shuffle generation from `annotate_preference.py:77,96-105`:

SHUFLLE_NUM = 1
# ...
count = 0
random_orders = []
while True:
    order = list(range(4))
    random.shuffle(order)
    if order not in random_orders:
        random_orders.append(order)
        count += 1
    if count == SHUFLLE_NUM:
        break

Shuffled input formatting from `annotate_preference.py:107-109`:

for order in random_orders:
    format_input = {"instruction": example["instruction"]}
    format_input.update({f"text_{i+1}": example["completions"][o]["response"] for i, o in enumerate(order)})

Annotation remapping from `annotate_preference.py:124-125`:

for j in range(4):
    completions[j]["annotations"][aspect].append(responses[order.index(j)])

Related Pages

Implementation:OpenBMB_UltraFeedback_GPT4_Preference_Annotator

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment