Principle:Openai Evals Solver Configuration Patterns

Knowledge Sources	Openai_Evals
Domains	Evaluation, Configuration Management, Benchmarking
Last Updated	2026-02-14 10:00 GMT

Overview

A configuration-driven approach to organizing complex solver compositions via YAML-based registration, enabling systematic benchmarking across multiple models, datasets, and reasoning strategies without code changes.

Description

Solver Configuration Patterns provide a declarative, code-free mechanism for defining and composing evaluation solvers. Rather than writing Python code for each combination of model, reasoning strategy, and dataset, researchers specify solver configurations in YAML files that are registered with the evaluation framework. This separation of configuration from implementation enables rapid experimentation and systematic benchmarking.

The YAML configuration system supports nested solver composition, where complex solver behaviours are built by layering simpler components:

Simple solvers directly call a language model with the prompt and return the response. These form the base layer of any composition.
Chain-of-thought (CoT) solvers compose two inner solvers: a reasoning solver that generates step-by-step thinking, and an extraction solver that reads the reasoning and produces a concise final answer.
HHH-wrapped solvers add alignment context around an inner solver, conditioning completion models to produce well-behaved responses.
Self-consistency solvers run the same prompt through a solver multiple times and aggregate the results (e.g., by majority vote) to produce a more reliable answer.

Each solver in the composition tree can have its own model-specific parameters, including temperature, maximum token count, and model identifier. This enables configurations like using GPT-4 for reasoning but a faster model for answer extraction, or using high temperature for self-consistency sampling but low temperature for final extraction.

A key pattern is systematic variation: a single YAML file can define solver variants for every combination of model and dataset. For example, a Theory of Mind benchmark configuration might define:

GPT-4 direct, GPT-4 CoT, GPT-4 HHH for each of three ToM task types.
GPT-3.5 direct, GPT-3.5 CoT, GPT-3.5 HHH for the same task types.
Base model (davinci) with HHH wrapping for each task type.

This combinatorial coverage ensures that every model-strategy-dataset combination is evaluated, enabling controlled comparisons where only one variable changes at a time.

Another important pattern is valid_answers constraints, which restrict the solver's output to a predefined set of acceptable answers. For multiple-choice tasks, this ensures the solver produces one of the valid option labels (e.g., "A", "B", "C", "D") rather than free-form text, improving evaluation reliability.

Usage

Apply solver configuration patterns in the following scenarios:

Systematic benchmarking across multiple models, reasoning strategies, and datasets.
Ablation studies that isolate the effect of a single component (e.g., does CoT help on this task?).
Reproducible evaluation where the exact configuration can be shared, versioned, and re-run.
Team collaboration where different researchers define solver variants without modifying shared code.

A representative configuration from the Theory of Mind solver config demonstrates the key patterns:

# Direct solver for GPT-4
theory_of_mind/gpt-4/direct:
  solver:
    class: evals.solvers.openai_solver:OpenAISolver
    args:
      model: gpt-4
      max_tokens: 512
      valid_answers:
        - "A"
        - "B"
      postprocessors:
        - evals.solvers.postprocessors:Strip

# CoT solver for GPT-4 with separate reasoning and extraction
theory_of_mind/gpt-4/cot:
  solver:
    class: evals.solvers.cot_solver:CoTSolver
    args:
      cot_solver:
        class: evals.solvers.openai_solver:OpenAISolver
        args:
          model: gpt-4
          max_tokens: 2048
          temperature: 0.7
      extract_solver:
        class: evals.solvers.openai_solver:OpenAISolver
        args:
          model: gpt-4
          max_tokens: 64
          temperature: 0.0
      valid_answers:
        - "A"
        - "B"

# HHH-wrapped solver for completion model
theory_of_mind/davinci/hhh:
  solver:
    class: evals.solvers.hhh_solver:HHHSolver
    args:
      solver:
        class: evals.solvers.openai_solver:OpenAISolver
        args:
          model: davinci
          max_tokens: 512
          temperature: 0.0

Naming convention: Solver configs follow the pattern eval_name/model/strategy, creating a natural hierarchy that maps directly to benchmark result tables.

Theoretical Basis

The theoretical foundation combines declarative configuration management with compositional design patterns. The YAML structure maps directly to a tree of solver objects, where each node in the tree is either a leaf (direct model call) or an internal node (compositional solver that delegates to children).

The configuration resolution algorithm proceeds as follows:

1. Parse the YAML configuration file:
   - Each top-level key is a solver registration name
   - Each value defines a solver tree (class + args, potentially nested)

2. For a given registration name, resolve the solver tree:
   a. Instantiate the top-level solver class
   b. For each argument that is itself a solver definition:
      - Recursively resolve and instantiate the nested solver
      - Pass the instantiated solver as a constructor argument
   c. Apply model-specific parameters (temperature, max_tokens)
   d. Configure postprocessors and valid_answers constraints

3. Register the fully resolved solver under its name

4. At evaluation time:
   - Look up the solver by registration name
   - The evaluation framework interacts only with the top-level solver
   - Internal composition is transparent to the evaluation

This approach embodies the open-closed principle: the system is open for extension (new solver configurations can be added via YAML without code changes) but closed for modification (existing solver implementations do not need to change to support new configurations). The result is a highly scalable benchmarking system where the number of evaluated configurations can grow combinatorially without corresponding growth in codebase complexity.

Related Pages

Implementation:Openai_Evals_Theory_of_Mind_Solver_Config

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment