Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Openai Evals SolverToolsConvo Runner

From Leeroopedia
Knowledge Sources
Domains Evaluation, Tool Use
Last Updated 2026-02-14 10:00 GMT

Overview

Concrete runner for orchestrating multi-turn solver-tool conversation loops, provided by the evals library.

Description

This module implements a multi-turn conversation loop for evaluating solvers that interact with tools. It defines three dataclasses and the Runner orchestrator class.

ToolCall holds a single tool invocation's tool_name, input string, and output (populated after execution). ParsedSolverResult contains a list of tool_calls and an optional final_answer extracted from the solver's response. RunnerResult bundles the final_task_state, final_solver_result, and computed metrics (correctness and turn count).

The Runner class drives the evaluation loop. On initialization it receives a solver, a sample dict (with "task" and "answer" keys), a name_to_tool mapping, a max_turns limit, and default message templates. The run method operates as follows:

  1. It builds the initial TaskState with tool descriptions formatted into the task description and the sample's task as the first user message.
  2. On each turn it calls the solver, then _parse_solver_result extracts tool calls and/or a final answer from the output.
  3. Tool calls are identified by the regex pattern (@ToolName: input) (excluding reserved names "Answer" and "Bugged") via _find_tool_messages.
  4. A final answer is extracted via _parse_final_answer matching (@Answer: output).
  5. If neither tool calls nor a final answer are found, a reminder message is appended and the loop continues.
  6. If a final answer is present, _finish_run compares it (case-insensitive, stripped) against the sample's expected answer and returns the result.
  7. Otherwise, each tool call is executed via _run_tool_call, which creates a ToolTaskState for the tool, invokes it, and captures the output. Tool outputs are formatted and appended as a user message.
  8. The loop repeats until a final answer is given or max_turns is reached.

Usage

Import Runner when building a tool-use evaluation that requires a multi-turn conversation between a solver and a set of tools. This is used by eval suites like bugged_tools where the solver must call external tools to complete a task and eventually emit a final answer.

Code Reference

Source Location

Signature

@dataclass
class ToolCall:
    tool_name: str
    input: str
    output: Any

@dataclass
class ParsedSolverResult:
    tool_calls: list[ToolCall]
    final_answer: Optional[str]

@dataclass
class RunnerResult:
    final_task_state: ToolTaskState
    final_solver_result: SolverResult
    metrics: dict

class Runner:
    def __init__(
        self,
        solver: Solver,
        sample: Any,
        name_to_tool: dict,
        max_turns: int,
        default_task_description: str,
        default_reminder_message: str,
    ):
        ...

    def run(self) -> RunnerResult:
        ...

    def _parse_solver_result(self, solver_result: SolverResult) -> ParsedSolverResult:
        ...

    def _parse_tool_calls(self, output: str) -> Optional[list[ToolCall]]:
        ...

    def _find_tool_messages(self, text: str) -> list[tuple[str, str]]:
        ...

    def _parse_final_answer(self, output: str) -> Optional[str]:
        ...

    def _run_tool_call(self, tool_call: ToolCall) -> ToolCall:
        ...

    def _finish_run(
        self,
        final_task_state: TaskState,
        solver_result: SolverResult,
        final_answer: Optional[str],
        turn: int,
    ) -> RunnerResult:
        ...

Import

from evals.elsuite.solver_tools_convo import Runner

I/O Contract

Inputs

Runner.__init__

Name Type Required Description
solver Solver Yes The solver instance that generates responses each turn
sample Any (dict) Yes A sample dict containing "task" (the problem statement) and "answer" (the expected answer)
name_to_tool dict[str, Tool] Yes Mapping from tool name strings to Tool instances available to the solver
max_turns int Yes Maximum number of conversation turns before the run is terminated
default_task_description str Yes Template for the system-level task description; must contain {tool_names_and_descriptions} placeholder
default_reminder_message str Yes Message sent to the solver when it produces neither tool calls nor a final answer

Runner.run

Name Type Required Description
(none) -- -- Uses instance attributes set during __init__; no additional arguments

Outputs

RunnerResult (returned by run)

Name Type Description
final_task_state ToolTaskState The complete conversation state at the end of the run, including all messages exchanged
final_solver_result SolverResult The last output produced by the solver before the run terminated
metrics dict Dictionary containing is_correct (bool: whether the final answer matched the expected answer) and num_turns (int: total number of turns used)

ParsedSolverResult (internal)

Name Type Description
tool_calls list[ToolCall] List of parsed tool calls found in the solver output (may be empty)
final_answer Optional[str] The final answer extracted from an (@Answer: ...) pattern, or None

Usage Examples

from evals.elsuite.solver_tools_convo import Runner, RunnerResult
from evals.solvers.solver import Solver
from evals.elsuite.bugged_tools.tools import Tool

# Assuming `my_solver` is a configured Solver instance
# and `calculator_tool` is a Tool instance:
sample = {
    "task": "What is 15 * 23? Use the calculator tool to find out.",
    "answer": "345",
}

runner = Runner(
    solver=my_solver,
    sample=sample,
    name_to_tool={"Calculator": calculator_tool},
    max_turns=5,
    default_task_description=(
        "You have access to the following tools:\n{tool_names_and_descriptions}\n"
        "To use a tool, write (@ToolName: input).\n"
        "To give your final answer, write (@Answer: your_answer)."
    ),
    default_reminder_message="Please use a tool or provide your final answer.",
)

result: RunnerResult = runner.run()
print(f"Correct: {result.metrics['is_correct']}")
print(f"Turns used: {result.metrics['num_turns']}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment