Principle:OpenBMB UltraFeedback Fine Grained Preference Annotation
| Knowledge Sources | |
|---|---|
| Domains | NLP, Evaluation, Preference_Learning |
| Last Updated | 2023-10-02 00:00 GMT |
Overview
A multi-aspect preference annotation strategy that uses GPT-4 to rate model completions across four independent quality dimensions with rubric-guided evaluation.
Description
Fine-Grained Preference Annotation is the second GPT-4 annotation pass in the UltraFeedback pipeline. Unlike the critique annotation (which produces a single holistic score), this pass evaluates each completion across four independent aspects:
- Instruction Following (1-5): Alignment between output and task intent, assessing goal understanding and constraint adherence.
- Honesty (1-5 or N/A): How well the model conveys uncertainty and calibrates confidence relative to correctness. Creative tasks receive N/A.
- Truthfulness (1-5): Accuracy assessment using a hallucination taxonomy with three types: factual errors, instruction-contradictory, and self-contradictory/logical errors.
- Helpfulness (1-5): Overall informativeness and correctness, evaluating clarity, comprehensiveness, and conciseness.
A critical design feature is randomized completion ordering: for each aspect evaluation, the 4 completions are presented in a random permutation to mitigate position bias (GPT-4 tends to favor earlier entries). The SHUFFLE_NUM parameter controls how many random orderings are evaluated per aspect (default: 1).
Each aspect uses a detailed rubric template that defines the rating scale, provides format examples, and specifies the expected output structure. Truthfulness and helpfulness aspects additionally receive world knowledge context for subsets where ground truth is available.
Usage
Use this principle when you need multi-dimensional preference signals rather than a single overall score. The per-aspect ratings enable fine-grained reward modeling where different aspects can be weighted differently during training.
Theoretical Basis
The multi-aspect design is grounded in the observation that a single preference score conflates multiple quality dimensions. A response can be highly helpful but dishonest, or truthful but unhelpful. Decomposing the evaluation allows:
- Training reward models that distinguish between different failure modes
- Identifying specific weaknesses in model behavior
- Creating preference pairs that are informative along specific dimensions
Annotation Schema:
# For instruction_following and honesty aspects:
annotation = {"Rating": int(1-5), "Rationale": str}
# For truthfulness and helpfulness aspects:
annotation = {
"Type": List[int] or "None", # Hallucination/informativeness types
"Rationale": str, # Rationale for type identification
"Rating": int(1-5), # Quality rating
"Rationale For Rating": str # Rationale for the rating
}
The randomized ordering mitigates GPT-4's known position bias:
# Generate random permutation of 4 completions
order = list(range(4))
random.shuffle(order)
# Present completions in shuffled order to GPT-4
# Map ratings back to original positions