Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:ContextualAI HALOs Summarize Metrics

From Leeroopedia


Knowledge Sources
Domains NLP, Evaluation, Data_Engineering
Last Updated 2026-02-08 03:00 GMT

Overview

Concrete tool for extracting and aggregating evaluation metrics from log files provided by the summarize_metrics.py script.

Description

The evals/scripts/summarize_metrics.py script parses evaluation log files to extract metrics from both LM Eval Harness table-formatted output and AlpacaEval results. It uses regex patterns to match:

  • LM Eval Harness rows: Task name, n-shot, metric, value, stderr
  • AlpacaEval results: Model path, LCWR, WR

The script can process a single log file or an entire directory of .log files, producing both CSV and JSON output files with per-model metrics tables.

Usage

Run as python evals/scripts/summarize_metrics.py /path/to/logs --output model_metrics after completing evaluations.

Code Reference

Source Location

  • Repository: ContextualAI/HALOs
  • File: evals/scripts/summarize_metrics.py
  • Lines: L8-41 (extract_alpaca_eval_metrics), L43-114 (extract_log_values), L116-131 (process_log_file), L133-150 (process_directory), L152-226 (main)

Signature

def extract_alpaca_eval_metrics(text: str) -> Dict:
    """Extract AlpacaEval LCWR and WR from log text.

    Returns:
        Dict mapping model_name -> {'alpacaeval_lcwr': float, 'alpacaeval_wr': float}
    """

def extract_log_values(text: str) -> Optional[Dict]:
    """Extract top-level metric values from LM Eval Harness log output.

    Returns:
        Dict with keys: 'model_name', per-task metrics (e.g., 'winogrande-acc'),
        optional 'alpacaeval_lcwr', 'alpacaeval_wr', 'avg'
        Returns None if no model name found.
    """

def process_log_file(file_path: str) -> Optional[Dict]:
    """Process a single log file and extract metrics."""

def process_directory(directory_path: str) -> List[Dict]:
    """Process all .log files in a directory."""

Import

# Run as CLI:
# python evals/scripts/summarize_metrics.py /path/to/logs --output model_metrics

# Or import functions:
from evals.scripts.summarize_metrics import extract_log_values, extract_alpaca_eval_metrics

I/O Contract

Inputs

Name Type Required Description
path str Yes Path to a .log file or directory of .log files
--output / -o str No Base name for output files (default: 'model_metrics')

Outputs

Name Type Description
{output}.csv File CSV with columns: model_name, alpacaeval_lcwr, alpacaeval_wr, per-task metrics, avg
{output}.json File Same data in JSON format
Summary table stdout Printed table of models ranked by average score

Usage Examples

Summarize a Directory of Logs

# Process all evaluation logs in a directory
python evals/scripts/summarize_metrics.py /logs/eval_results/ --output comparison

# Output:
# comparison.csv - Spreadsheet-compatible metrics table
# comparison.json - Programmatic access to all metrics

Summarize a Single Model

# Process a single model's evaluation log
python evals/scripts/summarize_metrics.py eval_llama3-8B-kto.log --output kto_metrics

Programmatic Usage

from evals.scripts.summarize_metrics import extract_log_values

with open('eval_log.log', 'r') as f:
    log_text = f.read()

metrics = extract_log_values(log_text)
print(f"Model: {metrics['model_name']}")
print(f"Average: {metrics['avg']}")
print(f"MMLU: {metrics.get('mmlu-acc', 'N/A')}")

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment