Implementation:ContextualAI HALOs Summarize Metrics
Appearance
| Knowledge Sources | |
|---|---|
| Domains | NLP, Evaluation, Data_Engineering |
| Last Updated | 2026-02-08 03:00 GMT |
Overview
Concrete tool for extracting and aggregating evaluation metrics from log files provided by the summarize_metrics.py script.
Description
The evals/scripts/summarize_metrics.py script parses evaluation log files to extract metrics from both LM Eval Harness table-formatted output and AlpacaEval results. It uses regex patterns to match:
- LM Eval Harness rows: Task name, n-shot, metric, value, stderr
- AlpacaEval results: Model path, LCWR, WR
The script can process a single log file or an entire directory of .log files, producing both CSV and JSON output files with per-model metrics tables.
Usage
Run as python evals/scripts/summarize_metrics.py /path/to/logs --output model_metrics after completing evaluations.
Code Reference
Source Location
- Repository: ContextualAI/HALOs
- File: evals/scripts/summarize_metrics.py
- Lines: L8-41 (extract_alpaca_eval_metrics), L43-114 (extract_log_values), L116-131 (process_log_file), L133-150 (process_directory), L152-226 (main)
Signature
def extract_alpaca_eval_metrics(text: str) -> Dict:
"""Extract AlpacaEval LCWR and WR from log text.
Returns:
Dict mapping model_name -> {'alpacaeval_lcwr': float, 'alpacaeval_wr': float}
"""
def extract_log_values(text: str) -> Optional[Dict]:
"""Extract top-level metric values from LM Eval Harness log output.
Returns:
Dict with keys: 'model_name', per-task metrics (e.g., 'winogrande-acc'),
optional 'alpacaeval_lcwr', 'alpacaeval_wr', 'avg'
Returns None if no model name found.
"""
def process_log_file(file_path: str) -> Optional[Dict]:
"""Process a single log file and extract metrics."""
def process_directory(directory_path: str) -> List[Dict]:
"""Process all .log files in a directory."""
Import
# Run as CLI:
# python evals/scripts/summarize_metrics.py /path/to/logs --output model_metrics
# Or import functions:
from evals.scripts.summarize_metrics import extract_log_values, extract_alpaca_eval_metrics
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path | str | Yes | Path to a .log file or directory of .log files |
| --output / -o | str | No | Base name for output files (default: 'model_metrics') |
Outputs
| Name | Type | Description |
|---|---|---|
| {output}.csv | File | CSV with columns: model_name, alpacaeval_lcwr, alpacaeval_wr, per-task metrics, avg |
| {output}.json | File | Same data in JSON format |
| Summary table | stdout | Printed table of models ranked by average score |
Usage Examples
Summarize a Directory of Logs
# Process all evaluation logs in a directory
python evals/scripts/summarize_metrics.py /logs/eval_results/ --output comparison
# Output:
# comparison.csv - Spreadsheet-compatible metrics table
# comparison.json - Programmatic access to all metrics
Summarize a Single Model
# Process a single model's evaluation log
python evals/scripts/summarize_metrics.py eval_llama3-8B-kto.log --output kto_metrics
Programmatic Usage
from evals.scripts.summarize_metrics import extract_log_values
with open('eval_log.log', 'r') as f:
log_text = f.read()
metrics = extract_log_values(log_text)
print(f"Model: {metrics['model_name']}")
print(f"Average: {metrics['avg']}")
print(f"MMLU: {metrics.get('mmlu-acc', 'N/A')}")
Related Pages
Implements Principle
Requires Environment
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment