Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft Autogen Agbench Tabulate Cmd

From Leeroopedia
Property Value
Source https://github.com/microsoft/autogen
Domains Benchmarking CLI Data_Analysis Statistics Reporting
Last Updated 2026-02-11 17:00 GMT

Overview

A comprehensive command-line tool that analyzes benchmark run logs, computes success rates and timing statistics across multiple trials, and presents results in formatted tables or CSV/Excel formats.

Description

The tabulate_cmd.py module provides sophisticated analysis and reporting capabilities for AutoGenBench results. It implements a flexible scoring system through default_scorer() which evaluates instance success by searching console logs for completion markers like "ALL TESTS PASSED !#!#" and "SCENARIO.PY COMPLETE !#!#", returning True for success, False for completed-but-failed runs, or None for incomplete runs. The default_timer() function extracts runtime information using regex pattern matching on "RUNTIME: <value> !#!#" markers. The core default_tabulate() function orchestrates the entire analysis pipeline: iterating through task directories, collecting success/failure data for each trial instance, computing aggregate statistics including success rates and timing metrics, and formatting output using the pandas and tabulate libraries. It supports multiple output formats including human-readable tables (default), CSV format via --csv flag, and Excel spreadsheets via --excel flag. The module generates comprehensive statistics including per-trial summaries (successes, failures, missing, success rates, average times), trial-aggregated metrics (at least one success rate, all successes rate), and detailed per-task breakdowns. The tabulate_cli() entry point implements a plugin system that searches for custom custom_tabulate.py files in the directory hierarchy, allowing benchmark-specific tabulation logic while falling back to the default implementation. The module includes robust handling of missing data, proper sorting by modification time, and exclusion of system directories like __pycache__.

Usage

Use this module when you need to:

  • Analyze and summarize benchmark run results across multiple trials
  • Generate statistical reports on benchmark success rates and performance
  • Export benchmark results to CSV or Excel for further analysis
  • Compute aggregate metrics like average success rates and total runtime
  • Identify missing or incomplete benchmark runs
  • Create publication-ready tables of benchmark performance
  • Implement custom tabulation logic for specific benchmark types
  • Compare performance across different tasks or configurations

Code Reference

Source Location: /tmp/kapso_repo_2mr4n2g4/python/packages/agbench/src/agbench/tabulate_cmd.py

Signature:

def tabulate_cli(args: Sequence[str]) -> None:
    """
    CLI entry point for benchmark result tabulation.

    Searches for custom tabulation modules or uses default implementation.
    Delegates to appropriate tabulation function with parsed arguments.

    Args:
        args: Command-line arguments where args[0] is invocation command
              and args[1:] contains parsed arguments

    Returns:
        None (outputs formatted results to stdout/stderr)
    """

def default_tabulate(
    args: List[str],
    scorer: ScorerFunc = default_scorer,
    timer: TimerFunc = default_timer,
    exclude_dir_names: List[str] = EXCLUDE_DIR_NAMES,
) -> None:
    """
    Default tabulation implementation for benchmark results.

    Scans runlogs directory, evaluates each instance, computes statistics,
    and formats output as tables, CSV, or Excel.

    Args:
        args: Command-line arguments list
        scorer: Function to evaluate instance success (returns bool or None)
        timer: Function to extract runtime from instance logs (returns float or None)
        exclude_dir_names: List of directory names to skip during traversal

    Returns:
        None (prints formatted results and statistics)
    """

def default_scorer(instance_dir: str, success_strings: List[str] = SUCCESS_STRINGS) -> Optional[bool]:
    """
    Evaluate if a benchmark instance succeeded, failed, or is incomplete.

    Args:
        instance_dir: Path to instance directory containing console_log.txt
        success_strings: List of strings indicating successful completion

    Returns:
        True if succeeded, False if completed but failed, None if incomplete
    """

def default_timer(instance_dir: str, timer_regex: str = TIMER_REGEX) -> Optional[float]:
    """
    Extract runtime value from instance console log.

    Args:
        instance_dir: Path to instance directory containing console_log.txt
        timer_regex: Regex pattern to match runtime information

    Returns:
        Float runtime value in seconds, or None if not found
    """

Import:

from agbench.tabulate_cmd import tabulate_cli, default_tabulate, default_scorer, default_timer

I/O Contract

Inputs

Parameter Type Required Description
args Sequence[str] Yes Command-line arguments. args[0] is invocation command, args[1] is runlogs path
runlogs str Yes Path to directory containing benchmark results organized by task_id/instance
--csv flag No Output results in CSV format instead of formatted tables
--excel str No Path for Excel output file (e.g., results.xlsx)
scorer ScorerFunc No Custom function to evaluate instance success (default: default_scorer)
timer TimerFunc No Custom function to extract runtime (default: default_timer)

Outputs

Output Type Description
Results table str Formatted table with per-task, per-trial success/failure data
Summary statistics str Per-trial aggregate metrics (successes, failures, rates, times)
Trial aggregated stats str Cross-trial success rates (at least one, all successes)
CSV output str Comma-separated values output when --csv flag is used
Excel file file Excel spreadsheet when --excel path is specified
Warning message str Alpha-version caution about citing values in academic work

Usage Examples

Command-Line Usage:

# Generate formatted table output (default)
autogenbench tabulate path/to/runlogs

# Example output:
#   Task Id         Trial 0 Success  Trial 0 Time  Trial 1 Success  Trial 1 Time
# --  ------------  -----------------  --------------  -----------------  --------------
#  0  task_abc123   True               45.2           True               43.8
#  1  task_def456   False              12.1           True               38.5
#  2  task_ghi789   True               67.3           True               65.9
#
# Summary Statistics
#            Successes  Failures  Missing  Total  Average Success Rate  Average Time  Total Time
# ---------  -----------  ----------  ---------  -------  ----------------------  --------------  ------------
# Trial 0    2          1         0        3      0.667                   41.5          124.6
# Trial 1    3          0         0        3      1.000                   49.4          148.2

# Generate CSV output
autogenbench tabulate path/to/runlogs --csv

# Generate Excel output
autogenbench tabulate path/to/runlogs --excel results.xlsx

# Use short form of CSV flag
autogenbench tabulate path/to/runlogs -c

Programmatic Usage:

from agbench.tabulate_cmd import default_tabulate, default_scorer, default_timer

# Tabulate with default settings
default_tabulate(
    args=["autogenbench tabulate", "/path/to/runlogs"]
)

# Tabulate with CSV output
default_tabulate(
    args=["autogenbench tabulate", "/path/to/runlogs", "--csv"]
)

# Use custom scorer and timer functions
def custom_scorer(instance_dir: str) -> Optional[bool]:
    """Custom logic to evaluate success"""
    # Your custom implementation
    pass

def custom_timer(instance_dir: str) -> Optional[float]:
    """Custom logic to extract runtime"""
    # Your custom implementation
    pass

default_tabulate(
    args=["autogenbench tabulate", "/path/to/runlogs"],
    scorer=custom_scorer,
    timer=custom_timer
)

Custom Tabulation Module:

# Create custom_tabulate.py in your benchmark directory

from typing import List
from agbench.tabulate_cmd import default_scorer, default_timer

def main(args: List[str]) -> None:
    """
    Custom tabulation logic for specific benchmark.

    This function will be called instead of default_tabulate
    when custom_tabulate.py is found in the directory hierarchy.
    """
    print("Using custom tabulation logic")

    # Define custom success strings for this benchmark
    custom_success_strings = [
        "CUSTOM_COMPLETE",
        "BENCHMARK_PASSED"
    ]

    # Define custom timer regex
    custom_timer_regex = r"TIME_ELAPSED:\s*([\d.]+)"

    # Use default implementation with custom parameters
    from agbench.tabulate_cmd import default_tabulate

    def custom_scorer(instance_dir: str) -> Optional[bool]:
        return default_scorer(instance_dir, custom_success_strings)

    def custom_timer(instance_dir: str) -> Optional[float]:
        return default_timer(instance_dir, custom_timer_regex)

    default_tabulate(args, scorer=custom_scorer, timer=custom_timer)

Direct Scorer and Timer Usage:

from agbench.tabulate_cmd import default_scorer, default_timer

# Check if an instance succeeded
instance_path = "/path/to/runlogs/task_123/0"
success = default_scorer(instance_path)

if success is True:
    print("Instance succeeded")
elif success is False:
    print("Instance completed but failed")
else:  # None
    print("Instance incomplete or not started")

# Get runtime for an instance
runtime = default_timer(instance_path)
if runtime is not None:
    print(f"Runtime: {runtime:.2f} seconds")
else:
    print("Runtime not available")

Advanced Analysis with Pandas:

import pandas as pd
from agbench.tabulate_cmd import default_scorer, default_timer
import os

# Build custom dataframe from runlogs
runlogs_path = "/path/to/runlogs"
results = []

for task_id in os.listdir(runlogs_path):
    task_path = os.path.join(runlogs_path, task_id)
    if not os.path.isdir(task_path):
        continue

    for instance in range(10):  # Check first 10 instances
        instance_dir = os.path.join(task_path, str(instance))
        if not os.path.isdir(instance_dir):
            break

        results.append({
            'task_id': task_id,
            'instance': instance,
            'success': default_scorer(instance_dir),
            'runtime': default_timer(instance_dir)
        })

df = pd.DataFrame(results)

# Custom analysis
success_rate = df['success'].mean()
avg_runtime = df['runtime'].mean(skipna=True)
print(f"Overall success rate: {success_rate:.2%}")
print(f"Average runtime: {avg_runtime:.2f}s")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment