Principle:Openai Evals Solver Output Postprocessing
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Text Normalization, Pipeline Design |
| Last Updated | 2026-02-14 10:00 GMT |
Overview
A text normalization pipeline applied to raw solver output before evaluation scoring, ensuring consistent formatting for reliable comparison against expected answers.
Description
Solver Output Postprocessing addresses a fundamental challenge in automated evaluation: raw model output is noisy. Models may produce answers wrapped in quotation marks, trailing periods, leading or trailing whitespace, or other formatting artefacts that are semantically irrelevant but would cause exact-match comparisons to fail. Postprocessing eliminates this noise by applying a deterministic sequence of text transformations.
The postprocessing pipeline is implemented as a chain of responsibility pattern. Each postprocessor is a small, single-purpose class that performs one specific text transformation. Postprocessors are composed in order and applied sequentially to the solver output. The output of one postprocessor becomes the input of the next, forming a clean pipeline from raw model text to evaluation-ready text.
The three core postprocessors are:
- Strip -- removes leading and trailing whitespace characters from the output.
- RemoveQuotes -- strips surrounding quotation marks (single or double) that models often add around their answers.
- RemovePeriod -- removes trailing periods that models frequently append to short answers.
Postprocessors are specified as class path strings in solver YAML configuration files. This design allows new postprocessors to be added without modifying existing code; any Python class that implements the postprocessor interface can be referenced by its fully qualified path.
The Solver base class is responsible for applying the postprocessor chain. After the solver's internal completion logic returns a raw result, the base class iterates through the configured postprocessors and applies each one in sequence before returning the final output to the evaluation framework.
Usage
Apply output postprocessing in the following scenarios:
- Exact-match evaluations where even minor formatting differences cause false negatives.
- Multiple-choice tasks where model answers like
"A."must be normalized toA. - Short-answer tasks where whitespace and punctuation artefacts are common.
Postprocessors are configured in the solver YAML specification:
solver:
class: evals.solvers.openai_solver:OpenAISolver
args:
model: gpt-4
postprocessors:
- evals.solvers.postprocessors:Strip
- evals.solvers.postprocessors:RemoveQuotes
- evals.solvers.postprocessors:RemovePeriod
The order matters: stripping whitespace before removing quotes ensures that "hello" is first trimmed to "hello" and then to hello. Reversing the order would leave whitespace intact.
Theoretical Basis
The theoretical foundation is rooted in canonical form reduction. In formal language theory, two strings are considered equivalent if they reduce to the same canonical form under a defined set of transformations. The postprocessor pipeline defines such a transformation set for evaluation purposes.
The algorithm proceeds as follows:
1. Solver produces raw output string: raw_output
2. Load postprocessor classes from YAML config (ordered list of class paths)
3. For each postprocessor P in the ordered list:
a. Apply P.transform(raw_output) to produce cleaned_output
b. Set raw_output = cleaned_output
4. Return final cleaned_output to the evaluation framework for scoring
This design follows the principle of separation of concerns: the solver is responsible only for generating output, the postprocessors are responsible only for normalization, and the evaluation framework is responsible only for scoring. No single component needs to understand the others' logic.