Implementation:Microsoft Autogen Agbench Tabulate Cmd
| Property | Value |
|---|---|
| Source | https://github.com/microsoft/autogen |
| Domains | Benchmarking CLI Data_Analysis Statistics Reporting |
| Last Updated | 2026-02-11 17:00 GMT |
Overview
A comprehensive command-line tool that analyzes benchmark run logs, computes success rates and timing statistics across multiple trials, and presents results in formatted tables or CSV/Excel formats.
Description
The tabulate_cmd.py module provides sophisticated analysis and reporting capabilities for AutoGenBench results. It implements a flexible scoring system through default_scorer() which evaluates instance success by searching console logs for completion markers like "ALL TESTS PASSED !#!#" and "SCENARIO.PY COMPLETE !#!#", returning True for success, False for completed-but-failed runs, or None for incomplete runs. The default_timer() function extracts runtime information using regex pattern matching on "RUNTIME: <value> !#!#" markers. The core default_tabulate() function orchestrates the entire analysis pipeline: iterating through task directories, collecting success/failure data for each trial instance, computing aggregate statistics including success rates and timing metrics, and formatting output using the pandas and tabulate libraries. It supports multiple output formats including human-readable tables (default), CSV format via --csv flag, and Excel spreadsheets via --excel flag. The module generates comprehensive statistics including per-trial summaries (successes, failures, missing, success rates, average times), trial-aggregated metrics (at least one success rate, all successes rate), and detailed per-task breakdowns. The tabulate_cli() entry point implements a plugin system that searches for custom custom_tabulate.py files in the directory hierarchy, allowing benchmark-specific tabulation logic while falling back to the default implementation. The module includes robust handling of missing data, proper sorting by modification time, and exclusion of system directories like __pycache__.
Usage
Use this module when you need to:
- Analyze and summarize benchmark run results across multiple trials
- Generate statistical reports on benchmark success rates and performance
- Export benchmark results to CSV or Excel for further analysis
- Compute aggregate metrics like average success rates and total runtime
- Identify missing or incomplete benchmark runs
- Create publication-ready tables of benchmark performance
- Implement custom tabulation logic for specific benchmark types
- Compare performance across different tasks or configurations
Code Reference
Source Location: /tmp/kapso_repo_2mr4n2g4/python/packages/agbench/src/agbench/tabulate_cmd.py
Signature:
def tabulate_cli(args: Sequence[str]) -> None:
"""
CLI entry point for benchmark result tabulation.
Searches for custom tabulation modules or uses default implementation.
Delegates to appropriate tabulation function with parsed arguments.
Args:
args: Command-line arguments where args[0] is invocation command
and args[1:] contains parsed arguments
Returns:
None (outputs formatted results to stdout/stderr)
"""
def default_tabulate(
args: List[str],
scorer: ScorerFunc = default_scorer,
timer: TimerFunc = default_timer,
exclude_dir_names: List[str] = EXCLUDE_DIR_NAMES,
) -> None:
"""
Default tabulation implementation for benchmark results.
Scans runlogs directory, evaluates each instance, computes statistics,
and formats output as tables, CSV, or Excel.
Args:
args: Command-line arguments list
scorer: Function to evaluate instance success (returns bool or None)
timer: Function to extract runtime from instance logs (returns float or None)
exclude_dir_names: List of directory names to skip during traversal
Returns:
None (prints formatted results and statistics)
"""
def default_scorer(instance_dir: str, success_strings: List[str] = SUCCESS_STRINGS) -> Optional[bool]:
"""
Evaluate if a benchmark instance succeeded, failed, or is incomplete.
Args:
instance_dir: Path to instance directory containing console_log.txt
success_strings: List of strings indicating successful completion
Returns:
True if succeeded, False if completed but failed, None if incomplete
"""
def default_timer(instance_dir: str, timer_regex: str = TIMER_REGEX) -> Optional[float]:
"""
Extract runtime value from instance console log.
Args:
instance_dir: Path to instance directory containing console_log.txt
timer_regex: Regex pattern to match runtime information
Returns:
Float runtime value in seconds, or None if not found
"""
Import:
from agbench.tabulate_cmd import tabulate_cli, default_tabulate, default_scorer, default_timer
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
args |
Sequence[str] |
Yes | Command-line arguments. args[0] is invocation command, args[1] is runlogs path |
runlogs |
str |
Yes | Path to directory containing benchmark results organized by task_id/instance |
--csv |
flag |
No | Output results in CSV format instead of formatted tables |
--excel |
str |
No | Path for Excel output file (e.g., results.xlsx) |
scorer |
ScorerFunc |
No | Custom function to evaluate instance success (default: default_scorer) |
timer |
TimerFunc |
No | Custom function to extract runtime (default: default_timer) |
Outputs
| Output | Type | Description |
|---|---|---|
| Results table | str |
Formatted table with per-task, per-trial success/failure data |
| Summary statistics | str |
Per-trial aggregate metrics (successes, failures, rates, times) |
| Trial aggregated stats | str |
Cross-trial success rates (at least one, all successes) |
| CSV output | str |
Comma-separated values output when --csv flag is used |
| Excel file | file |
Excel spreadsheet when --excel path is specified |
| Warning message | str |
Alpha-version caution about citing values in academic work |
Usage Examples
Command-Line Usage:
# Generate formatted table output (default)
autogenbench tabulate path/to/runlogs
# Example output:
# Task Id Trial 0 Success Trial 0 Time Trial 1 Success Trial 1 Time
# -- ------------ ----------------- -------------- ----------------- --------------
# 0 task_abc123 True 45.2 True 43.8
# 1 task_def456 False 12.1 True 38.5
# 2 task_ghi789 True 67.3 True 65.9
#
# Summary Statistics
# Successes Failures Missing Total Average Success Rate Average Time Total Time
# --------- ----------- ---------- --------- ------- ---------------------- -------------- ------------
# Trial 0 2 1 0 3 0.667 41.5 124.6
# Trial 1 3 0 0 3 1.000 49.4 148.2
# Generate CSV output
autogenbench tabulate path/to/runlogs --csv
# Generate Excel output
autogenbench tabulate path/to/runlogs --excel results.xlsx
# Use short form of CSV flag
autogenbench tabulate path/to/runlogs -c
Programmatic Usage:
from agbench.tabulate_cmd import default_tabulate, default_scorer, default_timer
# Tabulate with default settings
default_tabulate(
args=["autogenbench tabulate", "/path/to/runlogs"]
)
# Tabulate with CSV output
default_tabulate(
args=["autogenbench tabulate", "/path/to/runlogs", "--csv"]
)
# Use custom scorer and timer functions
def custom_scorer(instance_dir: str) -> Optional[bool]:
"""Custom logic to evaluate success"""
# Your custom implementation
pass
def custom_timer(instance_dir: str) -> Optional[float]:
"""Custom logic to extract runtime"""
# Your custom implementation
pass
default_tabulate(
args=["autogenbench tabulate", "/path/to/runlogs"],
scorer=custom_scorer,
timer=custom_timer
)
Custom Tabulation Module:
# Create custom_tabulate.py in your benchmark directory
from typing import List
from agbench.tabulate_cmd import default_scorer, default_timer
def main(args: List[str]) -> None:
"""
Custom tabulation logic for specific benchmark.
This function will be called instead of default_tabulate
when custom_tabulate.py is found in the directory hierarchy.
"""
print("Using custom tabulation logic")
# Define custom success strings for this benchmark
custom_success_strings = [
"CUSTOM_COMPLETE",
"BENCHMARK_PASSED"
]
# Define custom timer regex
custom_timer_regex = r"TIME_ELAPSED:\s*([\d.]+)"
# Use default implementation with custom parameters
from agbench.tabulate_cmd import default_tabulate
def custom_scorer(instance_dir: str) -> Optional[bool]:
return default_scorer(instance_dir, custom_success_strings)
def custom_timer(instance_dir: str) -> Optional[float]:
return default_timer(instance_dir, custom_timer_regex)
default_tabulate(args, scorer=custom_scorer, timer=custom_timer)
Direct Scorer and Timer Usage:
from agbench.tabulate_cmd import default_scorer, default_timer
# Check if an instance succeeded
instance_path = "/path/to/runlogs/task_123/0"
success = default_scorer(instance_path)
if success is True:
print("Instance succeeded")
elif success is False:
print("Instance completed but failed")
else: # None
print("Instance incomplete or not started")
# Get runtime for an instance
runtime = default_timer(instance_path)
if runtime is not None:
print(f"Runtime: {runtime:.2f} seconds")
else:
print("Runtime not available")
Advanced Analysis with Pandas:
import pandas as pd
from agbench.tabulate_cmd import default_scorer, default_timer
import os
# Build custom dataframe from runlogs
runlogs_path = "/path/to/runlogs"
results = []
for task_id in os.listdir(runlogs_path):
task_path = os.path.join(runlogs_path, task_id)
if not os.path.isdir(task_path):
continue
for instance in range(10): # Check first 10 instances
instance_dir = os.path.join(task_path, str(instance))
if not os.path.isdir(instance_dir):
break
results.append({
'task_id': task_id,
'instance': instance,
'success': default_scorer(instance_dir),
'runtime': default_timer(instance_dir)
})
df = pd.DataFrame(results)
# Custom analysis
success_rate = df['success'].mean()
avg_runtime = df['runtime'].mean(skipna=True)
print(f"Overall success rate: {success_rate:.2%}")
print(f"Average runtime: {avg_runtime:.2f}s")
Related Pages
- Agbench_CLI - Main CLI dispatcher that routes to this command
- Agbench_Remove_Missing_Cmd - Cleanup tool (typically used before tabulation)
- Statistical_Analysis - Computing success rates and aggregate metrics
- Data_Export - CSV and Excel output generation
- Pandas - DataFrame operations for result manipulation
- Tabulate - Table formatting for console output
- Plugin_System - Custom tabulation module loading mechanism