Heuristic:OpenBMB UltraFeedback Position Bias Mitigation
| Knowledge Sources | |
|---|---|
| Domains | Annotation, LLMs, Evaluation |
| Last Updated | 2026-02-08 06:00 GMT |
Overview
Randomized completion ordering technique to mitigate GPT-4's position bias when rating multiple completions.
Description
When GPT-4 evaluates multiple text completions side-by-side, it can exhibit position bias, systematically favoring or penalizing completions based on their presentation order rather than their actual quality. The UltraFeedback annotation pipeline mitigates this by generating multiple random permutations of the 4 completions and presenting each permutation to GPT-4 for independent evaluation. The `SHUFFLE_NUM` parameter controls how many distinct orderings are evaluated per instruction-aspect pair.
Usage
Use this heuristic whenever designing side-by-side LLM evaluation pipelines where a judge model rates multiple completions simultaneously. It is especially relevant for the GPT4_Preference_Annotator implementation, which evaluates 4 completions across 4 aspects (instruction_following, honesty, truthfulness, helpfulness). Without shuffling, ratings may be biased toward earlier or later positions.
The Insight (Rule of Thumb)
- Action: Shuffle the order of completions presented to the judge model and evaluate multiple random orderings.
- Value: `SHUFFLE_NUM = 1` is the current setting (single random ordering per aspect). Higher values (e.g., 3-5) would provide more robust debiasing at the cost of additional API calls.
- Trade-off: Each additional shuffle multiplies GPT-4 API calls and cost. With 4 aspects and SHUFFLE_NUM=1, there are 4 API calls per instruction. SHUFFLE_NUM=3 would triple this to 12.
- Implementation: Random permutations are generated ensuring no duplicates, then completions are reordered before presenting to GPT-4. Annotations are mapped back to the original completion indices after evaluation.
Reasoning
Position bias in LLM-as-judge evaluations is a well-documented phenomenon. Models tend to assign higher scores to earlier items in a list (primacy bias) or last items (recency bias). By randomizing the presentation order, the bias is distributed uniformly rather than systematically favoring specific completions. The annotation mapping (`completions[j]["annotations"][aspect].append(responses[order.index(j)])`) correctly remaps shuffled results back to the original completion positions, ensuring each completion accumulates ratings from its randomized position.
Code Evidence
SHUFFLE_NUM constant and shuffle generation from `annotate_preference.py:77,96-105`:
SHUFLLE_NUM = 1
# ...
count = 0
random_orders = []
while True:
order = list(range(4))
random.shuffle(order)
if order not in random_orders:
random_orders.append(order)
count += 1
if count == SHUFLLE_NUM:
break
Shuffled input formatting from `annotate_preference.py:107-109`:
for order in random_orders:
format_input = {"instruction": example["instruction"]}
format_input.update({f"text_{i+1}": example["completions"][o]["response"] for i, o in enumerate(order)})
Annotation remapping from `annotate_preference.py:124-125`:
for j in range(4):
completions[j]["annotations"][aspect].append(responses[order.index(j)])