Implementation:Openai Evals SolverToolsConvo Runner
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Tool Use |
| Last Updated | 2026-02-14 10:00 GMT |
Overview
Concrete runner for orchestrating multi-turn solver-tool conversation loops, provided by the evals library.
Description
This module implements a multi-turn conversation loop for evaluating solvers that interact with tools. It defines three dataclasses and the Runner orchestrator class.
ToolCall holds a single tool invocation's tool_name, input string, and output (populated after execution). ParsedSolverResult contains a list of tool_calls and an optional final_answer extracted from the solver's response. RunnerResult bundles the final_task_state, final_solver_result, and computed metrics (correctness and turn count).
The Runner class drives the evaluation loop. On initialization it receives a solver, a sample dict (with "task" and "answer" keys), a name_to_tool mapping, a max_turns limit, and default message templates. The run method operates as follows:
- It builds the initial TaskState with tool descriptions formatted into the task description and the sample's task as the first user message.
- On each turn it calls the solver, then _parse_solver_result extracts tool calls and/or a final answer from the output.
- Tool calls are identified by the regex pattern (@ToolName: input) (excluding reserved names "Answer" and "Bugged") via _find_tool_messages.
- A final answer is extracted via _parse_final_answer matching (@Answer: output).
- If neither tool calls nor a final answer are found, a reminder message is appended and the loop continues.
- If a final answer is present, _finish_run compares it (case-insensitive, stripped) against the sample's expected answer and returns the result.
- Otherwise, each tool call is executed via _run_tool_call, which creates a ToolTaskState for the tool, invokes it, and captures the output. Tool outputs are formatted and appended as a user message.
- The loop repeats until a final answer is given or max_turns is reached.
Usage
Import Runner when building a tool-use evaluation that requires a multi-turn conversation between a solver and a set of tools. This is used by eval suites like bugged_tools where the solver must call external tools to complete a task and eventually emit a final answer.
Code Reference
Source Location
- Repository: Openai_Evals
- File: evals/elsuite/solver_tools_convo.py
- Lines: 1-240
Signature
@dataclass
class ToolCall:
tool_name: str
input: str
output: Any
@dataclass
class ParsedSolverResult:
tool_calls: list[ToolCall]
final_answer: Optional[str]
@dataclass
class RunnerResult:
final_task_state: ToolTaskState
final_solver_result: SolverResult
metrics: dict
class Runner:
def __init__(
self,
solver: Solver,
sample: Any,
name_to_tool: dict,
max_turns: int,
default_task_description: str,
default_reminder_message: str,
):
...
def run(self) -> RunnerResult:
...
def _parse_solver_result(self, solver_result: SolverResult) -> ParsedSolverResult:
...
def _parse_tool_calls(self, output: str) -> Optional[list[ToolCall]]:
...
def _find_tool_messages(self, text: str) -> list[tuple[str, str]]:
...
def _parse_final_answer(self, output: str) -> Optional[str]:
...
def _run_tool_call(self, tool_call: ToolCall) -> ToolCall:
...
def _finish_run(
self,
final_task_state: TaskState,
solver_result: SolverResult,
final_answer: Optional[str],
turn: int,
) -> RunnerResult:
...
Import
from evals.elsuite.solver_tools_convo import Runner
I/O Contract
Inputs
Runner.__init__
| Name | Type | Required | Description |
|---|---|---|---|
| solver | Solver | Yes | The solver instance that generates responses each turn |
| sample | Any (dict) | Yes | A sample dict containing "task" (the problem statement) and "answer" (the expected answer) |
| name_to_tool | dict[str, Tool] | Yes | Mapping from tool name strings to Tool instances available to the solver |
| max_turns | int | Yes | Maximum number of conversation turns before the run is terminated |
| default_task_description | str | Yes | Template for the system-level task description; must contain {tool_names_and_descriptions} placeholder |
| default_reminder_message | str | Yes | Message sent to the solver when it produces neither tool calls nor a final answer |
Runner.run
| Name | Type | Required | Description |
|---|---|---|---|
| (none) | -- | -- | Uses instance attributes set during __init__; no additional arguments |
Outputs
RunnerResult (returned by run)
| Name | Type | Description |
|---|---|---|
| final_task_state | ToolTaskState | The complete conversation state at the end of the run, including all messages exchanged |
| final_solver_result | SolverResult | The last output produced by the solver before the run terminated |
| metrics | dict | Dictionary containing is_correct (bool: whether the final answer matched the expected answer) and num_turns (int: total number of turns used) |
ParsedSolverResult (internal)
| Name | Type | Description |
|---|---|---|
| tool_calls | list[ToolCall] | List of parsed tool calls found in the solver output (may be empty) |
| final_answer | Optional[str] | The final answer extracted from an (@Answer: ...) pattern, or None |
Usage Examples
from evals.elsuite.solver_tools_convo import Runner, RunnerResult
from evals.solvers.solver import Solver
from evals.elsuite.bugged_tools.tools import Tool
# Assuming `my_solver` is a configured Solver instance
# and `calculator_tool` is a Tool instance:
sample = {
"task": "What is 15 * 23? Use the calculator tool to find out.",
"answer": "345",
}
runner = Runner(
solver=my_solver,
sample=sample,
name_to_tool={"Calculator": calculator_tool},
max_turns=5,
default_task_description=(
"You have access to the following tools:\n{tool_names_and_descriptions}\n"
"To use a tool, write (@ToolName: input).\n"
"To give your final answer, write (@Answer: your_answer)."
),
default_reminder_message="Please use a tool or provide your final answer.",
)
result: RunnerResult = runner.run()
print(f"Correct: {result.metrics['is_correct']}")
print(f"Turns used: {result.metrics['num_turns']}")