Implementation:Openai Evals Theory of Mind Solver Config
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Configuration |
| Last Updated | 2026-02-14 10:00 GMT |
Overview
The Theory of Mind solver configuration file defines reusable solver pipelines for three Theory of Mind benchmarks (ToMi, SocialIQA, and HiToM), covering four model variants and multiple reasoning strategies including simple completion, chain-of-thought, and self-consistency.
Description
theory_of_mind.yaml is a declarative YAML configuration file in the OpenAI Evals solver registry. It contains 421 lines defining solver configurations that are referenced by Theory of Mind eval specs at runtime. Unlike eval config files (which define what to evaluate), solver config files define how a model should approach answering questions -- including prompt wrappers, reasoning strategies, and answer constraints.
The file is organized into three sections, each corresponding to a different Theory of Mind dataset:
ToMi (Theory of Mind Inventory)
ToMi is an open-ended completion task where the expected answer is typically a single word. The file defines 8 solver configurations for ToMi:
- Simple solvers for gpt-3.5-turbo, code-davinci-002, gpt-4, and gpt-4-base -- direct answer extraction with max_tokens: 10 and temperature: 0.
- CoT (chain-of-thought) solvers for the same four models -- two-stage pipeline where a cot_solver generates reasoning (temperature: 1, max_tokens: 512) and an extract_solver produces the final answer (temperature: 0, max_tokens: 10).
Completion models (code-davinci-002 and gpt-4-base) are wrapped in HHHSolver to prepend helpful-harmless-honest system prompting, since base models lack instruction-following capabilities.
SocialIQA
SocialIQA is a multiple-choice task with three options (A, B, C). The file defines 9 solver configurations:
- Simple solvers for all four models -- constrained to valid_answers: ["A", "B", "C"] with max_tokens: 1 (or max_tokens: 2 for code-davinci-002 due to a tokenizer issue).
- CoT solvers for all four models -- two-stage pipeline as above, with the extract stage constrained to valid answers.
- SelfConsistencySolver for gpt-4 -- runs the solver multiple times with temperature: 1 and max_tokens: 128, then uses a "judge" mode to select the most consistent answer.
HiToM (Higher-order Theory of Mind)
HiToM is a multiple-choice task with up to 15 options (A through O). The file defines 8 solver configurations:
- Simple solvers for all four models -- constrained to valid_answers: ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O"].
- CoT solvers for all four models -- two-stage pipeline with valid answer constraints on the extract stage.
The final entry (hitom/cot_solver/gpt-4-base) uses a slightly different structure with cot_options and extract_options keys directly instead of nested solver objects, and uses max_tokens: 64 for the CoT reasoning stage rather than 512.
Solver Classes
Four solver classes are referenced throughout the file:
- evals.solvers.providers.openai.openai_solver:OpenAISolver -- base solver that calls the OpenAI API with specified model and parameters.
- evals.solvers.nested.cot_solver:CoTSolver -- wraps two solvers: one generates chain-of-thought reasoning, the other extracts a final answer.
- evals.solvers.nested.hhh_solver:HHHSolver -- wraps a solver with a helpful-harmless-honest system prompt; used for base/completion models.
- evals.solvers.nested.self_consistency_solver:SelfConsistencySolver -- runs a solver multiple times and aggregates results for consistency.
Usage
Use this configuration when running any of the three Theory of Mind eval benchmarks. The eval spec references solver IDs from this file by their YAML key (e.g., tomi/cot_solver/gpt-4). Solver configs decouple the evaluation task definition from the model and reasoning strategy, enabling the same eval to be run with different solver configurations without modifying the eval spec.
Code Reference
Source Location
- Repository: Openai_Evals
- File: evals/registry/solvers/theory_of_mind.yaml
- Lines: 1-421
Configuration Schema
ToMi simple solver (chat model):
tomi/simple_solver/gpt-3.5-turbo:
class: evals.solvers.providers.openai.openai_solver:OpenAISolver
args:
completion_fn_options:
model: gpt-3.5-turbo
extra_options:
temperature: 0
max_tokens: 10
ToMi CoT solver (chat model):
tomi/cot_solver/gpt-3.5-turbo:
class: evals.solvers.nested.cot_solver:CoTSolver
args:
cot_solver:
class: evals.solvers.providers.openai.openai_solver:OpenAISolver
args:
completion_fn_options:
model: gpt-3.5-turbo
extra_options:
temperature: 1
max_tokens: 512
extract_solver:
class: evals.solvers.providers.openai.openai_solver:OpenAISolver
args:
completion_fn_options:
model: gpt-3.5-turbo
extra_options:
temperature: 0
max_tokens: 10
SocialIQA simple solver with HHH wrapper (completion model):
socialiqa/simple_solver/code-davinci-002:
class: evals.solvers.nested.hhh_solver:HHHSolver
args:
solver:
class: evals.solvers.providers.openai.openai_solver:OpenAISolver
args:
completion_fn_options:
model: code-davinci-002
extra_options:
temperature: 0
max_tokens: 2
valid_answers: ["A", "B", "C"]
SocialIQA SelfConsistency solver:
socialiqa/selfconsistency/gpt-4:
class: evals.solvers.nested.self_consistency_solver:SelfConsistencySolver
args:
solver:
class: evals.solvers.providers.openai.openai_solver:OpenAISolver
args:
completion_fn_options:
model: gpt-4
extra_options:
temperature: 1
max_tokens: 128
mode: "judge"
HiToM simple solver with valid answer constraint:
hitom/simple_solver/gpt-3.5-turbo:
class: evals.solvers.providers.openai.openai_solver:OpenAISolver
args:
completion_fn_options:
model: gpt-3.5-turbo
extra_options:
temperature: 0
max_tokens: 1
valid_answers: ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O"]
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| class | string | Yes | Fully-qualified Python class for the solver (see Solver Classes section above) |
| args.completion_fn_options.model | string | Yes | OpenAI model identifier (e.g., gpt-3.5-turbo, gpt-4, code-davinci-002, gpt-4-base) |
| args.completion_fn_options.extra_options.temperature | float | Yes | Sampling temperature; 0 for deterministic extraction, 1 for diverse reasoning |
| args.completion_fn_options.extra_options.max_tokens | int | Yes | Maximum tokens to generate; 1-2 for single-letter answers, 10 for short answers, 64-512 for chain-of-thought reasoning |
| args.valid_answers | list[string] | No | Constrains the model output to one of the provided answer options (used for multiple-choice tasks) |
| args.cot_solver | object | No | Nested solver spec for the chain-of-thought reasoning stage (CoTSolver only) |
| args.extract_solver | object | No | Nested solver spec for the answer extraction stage (CoTSolver only) |
| args.solver | object | No | Nested solver spec wrapped by HHHSolver or SelfConsistencySolver |
| args.mode | string | No | Aggregation mode for SelfConsistencySolver; "judge" uses a model to pick the best answer |
Outputs
| Name | Type | Description |
|---|---|---|
| solver_result | string | The final answer produced by the solver pipeline (a word for ToMi, a letter for SocialIQA and HiToM) |
Usage Examples
Running ToMi with a Simple Solver
oaieval tomi/simple_solver/gpt-3.5-turbo tomi
Running SocialIQA with Chain-of-Thought
oaieval socialiqa/cot_solver/gpt-4 socialiqa
Running HiToM with a Base Model
oaieval hitom/simple_solver/gpt-4-base hitom
Running SocialIQA with Self-Consistency
oaieval socialiqa/selfconsistency/gpt-4 socialiqa
Related Pages
- Openai_Evals_Solver_Base_Class -- defines the base Solver interface that all solver classes inherit from
- Openai_Evals_Eval_YAML_Registration -- describes the registry mechanism that loads YAML configuration files
- Openai_Evals_Oaieval_Run -- the CLI entrypoint that wires solvers to eval specs at runtime