Implementation:EvolvingLMMs Lab Lmms eval logging utils
| Knowledge Sources | |
|---|---|
| Domains | Logging, Experiment Tracking, Weights & Biases |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Utilities for logging evaluation results and samples to Weights & Biases for experiment tracking and visualization.
Description
This module provides the WandbLogger class and supporting utilities for integrating lmms-eval with Weights & Biases (W&B). It handles initialization of W&B runs, logs evaluation results as tables and artifacts, processes and sanitizes metrics dictionaries, and uploads evaluation samples as dataframes. The code includes retry logic for robustness, handles non-serializable objects, and supports both individual task results and grouped task results.
Usage
Use this module when you want to track evaluation experiments in W&B, compare results across multiple runs, visualize metrics in the W&B dashboard, or archive evaluation results and samples as versioned artifacts. Initialize WandbLogger with command-line args, call post_init() after evaluation, then use log_eval_result() and log_eval_samples() to upload data.
Code Reference
Source Location
- Repository: EvolvingLMMs_Lab_Lmms_eval
- File: lmms_eval/logging_utils.py
- Lines: 1-366
Signature
class WandbLogger:
def __init__(self, args)
def finish(self)
def init_run(self)
def post_init(self, results: Dict[str, Any]) -> None
def log_eval_result(self) -> None
def log_eval_samples(self, samples: Dict[str, List[Dict[str, Any]]]) -> None
def remove_none_pattern(input_string: str) -> Tuple[str, bool]
def _handle_non_serializable(o: Any) -> Union[int, str, list]
def get_wandb_printer() -> Literal["Printer"]
Import
from lmms_eval.logging_utils import WandbLogger
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| args | argparse.Namespace | Yes | Command-line arguments containing wandb_args and evaluation config |
| results | Dict[str, Any] | Yes | Evaluation results dictionary with 'results', 'configs', 'groups' keys |
| samples | Dict[str, List[Dict]] | Yes | Per-task evaluation samples with predictions and metrics |
Outputs
| Name | Type | Description |
|---|---|---|
| W&B run | wandb.Run | Initialized W&B run object |
| Logged metrics | wandb logs | Evaluation metrics logged to W&B dashboard |
| Tables | wandb.Table | Evaluation results formatted as interactive tables |
| Artifacts | wandb.Artifact | JSON files with complete results and samples |
Usage Examples
Basic Usage
from lmms_eval.logging_utils import WandbLogger
# Initialize logger with command-line args
logger = WandbLogger(args)
# After evaluation completes
logger.post_init(results)
# Log results to W&B
logger.log_eval_result()
# Log individual samples
logger.log_eval_samples(samples)
# Finish the run
logger.finish()
Handling Non-Serializable Objects
import json
from lmms_eval.logging_utils import _handle_non_serializable
results = {
"metric": np.int64(42),
"tasks": {"task1", "task2"},
"complex_obj": CustomObject()
}
# Safely serialize to JSON
json_str = json.dumps(results, default=_handle_non_serializable)
Custom W&B Configuration
# Via command-line args
args.wandb_args = "project=my-project,name=my-run,tags=tag1;tag2"
# Or via environment variables
import os
os.environ["WANDB_PROJECT"] = "my-project"
os.environ["WANDB_MODE"] = "offline" # For offline mode
logger = WandbLogger(args)
Implementation Details
Metric Sanitization
The logger removes ",none" suffixes from metric names and separates string-valued metrics into wandb.run.summary to ensure numeric metrics can be properly plotted.
Result Organization
Metrics are restructured from nested dictionaries to flat keys like "task_name/metric_name" for better W&B visualization.
Retry Logic
The init_run() method uses tenacity with exponential backoff (5 attempts, 5-second intervals) to handle transient network issues.
Artifact Structure
- results: Complete evaluation results as JSON
- samples_by_task: Individual task samples as JSON files
- tables: Interactive tables for exploration in W&B UI