Implementation:Openai Evals SolverToolsConvo Runner

Knowledge Sources	Openai_Evals
Domains	Evaluation, Tool Use
Last Updated	2026-02-14 10:00 GMT

Overview

Concrete runner for orchestrating multi-turn solver-tool conversation loops, provided by the evals library.

Description

This module implements a multi-turn conversation loop for evaluating solvers that interact with tools. It defines three dataclasses and the Runner orchestrator class.

ToolCall holds a single tool invocation's tool_name, input string, and output (populated after execution). ParsedSolverResult contains a list of tool_calls and an optional final_answer extracted from the solver's response. RunnerResult bundles the final_task_state, final_solver_result, and computed metrics (correctness and turn count).

The Runner class drives the evaluation loop. On initialization it receives a solver, a sample dict (with "task" and "answer" keys), a name_to_tool mapping, a max_turns limit, and default message templates. The run method operates as follows:

It builds the initial TaskState with tool descriptions formatted into the task description and the sample's task as the first user message.
On each turn it calls the solver, then _parse_solver_result extracts tool calls and/or a final answer from the output.
Tool calls are identified by the regex pattern (@ToolName: input) (excluding reserved names "Answer" and "Bugged") via _find_tool_messages.
A final answer is extracted via _parse_final_answer matching (@Answer: output).
If neither tool calls nor a final answer are found, a reminder message is appended and the loop continues.
If a final answer is present, _finish_run compares it (case-insensitive, stripped) against the sample's expected answer and returns the result.
Otherwise, each tool call is executed via _run_tool_call, which creates a ToolTaskState for the tool, invokes it, and captures the output. Tool outputs are formatted and appended as a user message.
The loop repeats until a final answer is given or max_turns is reached.

Usage

Import Runner when building a tool-use evaluation that requires a multi-turn conversation between a solver and a set of tools. This is used by eval suites like bugged_tools where the solver must call external tools to complete a task and eventually emit a final answer.

Code Reference

Source Location

Repository: Openai_Evals
File: evals/elsuite/solver_tools_convo.py
Lines: 1-240

Signature

@dataclass
class ToolCall:
    tool_name: str
    input: str
    output: Any

@dataclass
class ParsedSolverResult:
    tool_calls: list[ToolCall]
    final_answer: Optional[str]

@dataclass
class RunnerResult:
    final_task_state: ToolTaskState
    final_solver_result: SolverResult
    metrics: dict

class Runner:
    def __init__(
        self,
        solver: Solver,
        sample: Any,
        name_to_tool: dict,
        max_turns: int,
        default_task_description: str,
        default_reminder_message: str,
    ):
        ...

    def run(self) -> RunnerResult:
        ...

    def _parse_solver_result(self, solver_result: SolverResult) -> ParsedSolverResult:
        ...

    def _parse_tool_calls(self, output: str) -> Optional[list[ToolCall]]:
        ...

    def _find_tool_messages(self, text: str) -> list[tuple[str, str]]:
        ...

    def _parse_final_answer(self, output: str) -> Optional[str]:
        ...

    def _run_tool_call(self, tool_call: ToolCall) -> ToolCall:
        ...

    def _finish_run(
        self,
        final_task_state: TaskState,
        solver_result: SolverResult,
        final_answer: Optional[str],
        turn: int,
    ) -> RunnerResult:
        ...

Import

from evals.elsuite.solver_tools_convo import Runner

I/O Contract

Inputs

Runner.init

Name	Type	Required	Description
solver	Solver	Yes	The solver instance that generates responses each turn
sample	Any (dict)	Yes	A sample dict containing "task" (the problem statement) and "answer" (the expected answer)
name_to_tool	dict[str, Tool]	Yes	Mapping from tool name strings to Tool instances available to the solver
max_turns	int	Yes	Maximum number of conversation turns before the run is terminated
default_task_description	str	Yes	Template for the system-level task description; must contain {tool_names_and_descriptions} placeholder
default_reminder_message	str	Yes	Message sent to the solver when it produces neither tool calls nor a final answer

Runner.run

Name	Type	Required	Description
(none)	--	--	Uses instance attributes set during __init__; no additional arguments

Outputs

RunnerResult (returned by run)

Name	Type	Description
final_task_state	ToolTaskState	The complete conversation state at the end of the run, including all messages exchanged
final_solver_result	SolverResult	The last output produced by the solver before the run terminated
metrics	dict	Dictionary containing is_correct (bool: whether the final answer matched the expected answer) and num_turns (int: total number of turns used)

ParsedSolverResult (internal)

Name	Type	Description
tool_calls	list[ToolCall]	List of parsed tool calls found in the solver output (may be empty)
final_answer	Optional[str]	The final answer extracted from an (@Answer: ...) pattern, or None

Usage Examples

from evals.elsuite.solver_tools_convo import Runner, RunnerResult
from evals.solvers.solver import Solver
from evals.elsuite.bugged_tools.tools import Tool

# Assuming `my_solver` is a configured Solver instance
# and `calculator_tool` is a Tool instance:
sample = {
    "task": "What is 15 * 23? Use the calculator tool to find out.",
    "answer": "345",
}

runner = Runner(
    solver=my_solver,
    sample=sample,
    name_to_tool={"Calculator": calculator_tool},
    max_turns=5,
    default_task_description=(
        "You have access to the following tools:\n{tool_names_and_descriptions}\n"
        "To use a tool, write (@ToolName: input).\n"
        "To give your final answer, write (@Answer: your_answer)."
    ),
    default_reminder_message="Please use a tool or provide your final answer.",
)

result: RunnerResult = runner.run()
print(f"Correct: {result.metrics['is_correct']}")
print(f"Turns used: {result.metrics['num_turns']}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment