Implementation:Openai Evals Theory of Mind Solver Config

Knowledge Sources	Openai_Evals
Domains	Evaluation, Configuration
Last Updated	2026-02-14 10:00 GMT

Overview

The Theory of Mind solver configuration file defines reusable solver pipelines for three Theory of Mind benchmarks (ToMi, SocialIQA, and HiToM), covering four model variants and multiple reasoning strategies including simple completion, chain-of-thought, and self-consistency.

Description

theory_of_mind.yaml is a declarative YAML configuration file in the OpenAI Evals solver registry. It contains 421 lines defining solver configurations that are referenced by Theory of Mind eval specs at runtime. Unlike eval config files (which define what to evaluate), solver config files define how a model should approach answering questions -- including prompt wrappers, reasoning strategies, and answer constraints.

The file is organized into three sections, each corresponding to a different Theory of Mind dataset:

ToMi (Theory of Mind Inventory)

ToMi is an open-ended completion task where the expected answer is typically a single word. The file defines 8 solver configurations for ToMi:

Simple solvers for gpt-3.5-turbo, code-davinci-002, gpt-4, and gpt-4-base -- direct answer extraction with max_tokens: 10 and temperature: 0.
CoT (chain-of-thought) solvers for the same four models -- two-stage pipeline where a cot_solver generates reasoning (temperature: 1, max_tokens: 512) and an extract_solver produces the final answer (temperature: 0, max_tokens: 10).

Completion models (code-davinci-002 and gpt-4-base) are wrapped in HHHSolver to prepend helpful-harmless-honest system prompting, since base models lack instruction-following capabilities.

SocialIQA

SocialIQA is a multiple-choice task with three options (A, B, C). The file defines 9 solver configurations:

Simple solvers for all four models -- constrained to valid_answers: ["A", "B", "C"] with max_tokens: 1 (or max_tokens: 2 for code-davinci-002 due to a tokenizer issue).
CoT solvers for all four models -- two-stage pipeline as above, with the extract stage constrained to valid answers.
SelfConsistencySolver for gpt-4 -- runs the solver multiple times with temperature: 1 and max_tokens: 128, then uses a "judge" mode to select the most consistent answer.

HiToM (Higher-order Theory of Mind)

HiToM is a multiple-choice task with up to 15 options (A through O). The file defines 8 solver configurations:

Simple solvers for all four models -- constrained to valid_answers: ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O"].
CoT solvers for all four models -- two-stage pipeline with valid answer constraints on the extract stage.

The final entry (hitom/cot_solver/gpt-4-base) uses a slightly different structure with cot_options and extract_options keys directly instead of nested solver objects, and uses max_tokens: 64 for the CoT reasoning stage rather than 512.

Solver Classes

Four solver classes are referenced throughout the file:

evals.solvers.providers.openai.openai_solver:OpenAISolver -- base solver that calls the OpenAI API with specified model and parameters.
evals.solvers.nested.cot_solver:CoTSolver -- wraps two solvers: one generates chain-of-thought reasoning, the other extracts a final answer.
evals.solvers.nested.hhh_solver:HHHSolver -- wraps a solver with a helpful-harmless-honest system prompt; used for base/completion models.
evals.solvers.nested.self_consistency_solver:SelfConsistencySolver -- runs a solver multiple times and aggregates results for consistency.

Usage

Use this configuration when running any of the three Theory of Mind eval benchmarks. The eval spec references solver IDs from this file by their YAML key (e.g., tomi/cot_solver/gpt-4). Solver configs decouple the evaluation task definition from the model and reasoning strategy, enabling the same eval to be run with different solver configurations without modifying the eval spec.

Code Reference

Source Location

Repository: Openai_Evals
File: evals/registry/solvers/theory_of_mind.yaml
Lines: 1-421

Configuration Schema

ToMi simple solver (chat model):

tomi/simple_solver/gpt-3.5-turbo:
  class: evals.solvers.providers.openai.openai_solver:OpenAISolver
  args:
    completion_fn_options:
      model: gpt-3.5-turbo
      extra_options:
        temperature: 0
        max_tokens: 10

ToMi CoT solver (chat model):

tomi/cot_solver/gpt-3.5-turbo:
  class: evals.solvers.nested.cot_solver:CoTSolver
  args:
    cot_solver:
      class: evals.solvers.providers.openai.openai_solver:OpenAISolver
      args:
        completion_fn_options:
          model: gpt-3.5-turbo
          extra_options:
            temperature: 1
            max_tokens: 512
    extract_solver:
      class: evals.solvers.providers.openai.openai_solver:OpenAISolver
      args:
        completion_fn_options:
          model: gpt-3.5-turbo
          extra_options:
            temperature: 0
            max_tokens: 10

SocialIQA simple solver with HHH wrapper (completion model):

socialiqa/simple_solver/code-davinci-002:
  class: evals.solvers.nested.hhh_solver:HHHSolver
  args:
    solver:
      class: evals.solvers.providers.openai.openai_solver:OpenAISolver
      args:
        completion_fn_options:
          model: code-davinci-002
          extra_options:
            temperature: 0
            max_tokens: 2
        valid_answers: ["A", "B", "C"]

SocialIQA SelfConsistency solver:

socialiqa/selfconsistency/gpt-4:
  class: evals.solvers.nested.self_consistency_solver:SelfConsistencySolver
  args:
    solver:
      class: evals.solvers.providers.openai.openai_solver:OpenAISolver
      args:
        completion_fn_options:
          model: gpt-4
          extra_options:
            temperature: 1
            max_tokens: 128
    mode: "judge"

HiToM simple solver with valid answer constraint:

hitom/simple_solver/gpt-3.5-turbo:
  class: evals.solvers.providers.openai.openai_solver:OpenAISolver
  args:
    completion_fn_options:
      model: gpt-3.5-turbo
      extra_options:
        temperature: 0
        max_tokens: 1
    valid_answers: ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O"]

I/O Contract

Inputs

Name	Type	Required	Description
class	string	Yes	Fully-qualified Python class for the solver (see Solver Classes section above)
args.completion_fn_options.model	string	Yes	OpenAI model identifier (e.g., gpt-3.5-turbo, gpt-4, code-davinci-002, gpt-4-base)
args.completion_fn_options.extra_options.temperature	float	Yes	Sampling temperature; 0 for deterministic extraction, 1 for diverse reasoning
args.completion_fn_options.extra_options.max_tokens	int	Yes	Maximum tokens to generate; 1-2 for single-letter answers, 10 for short answers, 64-512 for chain-of-thought reasoning
args.valid_answers	list[string]	No	Constrains the model output to one of the provided answer options (used for multiple-choice tasks)
args.cot_solver	object	No	Nested solver spec for the chain-of-thought reasoning stage (CoTSolver only)
args.extract_solver	object	No	Nested solver spec for the answer extraction stage (CoTSolver only)
args.solver	object	No	Nested solver spec wrapped by HHHSolver or SelfConsistencySolver
args.mode	string	No	Aggregation mode for SelfConsistencySolver; "judge" uses a model to pick the best answer

Outputs

Name	Type	Description
solver_result	string	The final answer produced by the solver pipeline (a word for ToMi, a letter for SocialIQA and HiToM)

Usage Examples

Running ToMi with a Simple Solver

oaieval tomi/simple_solver/gpt-3.5-turbo tomi

Running SocialIQA with Chain-of-Thought

oaieval socialiqa/cot_solver/gpt-4 socialiqa

Running HiToM with a Base Model

oaieval hitom/simple_solver/gpt-4-base hitom

Running SocialIQA with Self-Consistency

oaieval socialiqa/selfconsistency/gpt-4 socialiqa

Related Pages

Openai_Evals_Solver_Base_Class -- defines the base Solver interface that all solver classes inherit from
Openai_Evals_Eval_YAML_Registration -- describes the registry mechanism that loads YAML configuration files
Openai_Evals_Oaieval_Run -- the CLI entrypoint that wires solvers to eval specs at runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment