Implementation:EvolvingLMMs Lab Lmms eval logging utils

Knowledge Sources	EvolvingLMMs_Lab_Lmms_eval
Domains	Logging, Experiment Tracking, Weights & Biases
Last Updated	2026-02-14 00:00 GMT

Overview

Utilities for logging evaluation results and samples to Weights & Biases for experiment tracking and visualization.

Description

This module provides the WandbLogger class and supporting utilities for integrating lmms-eval with Weights & Biases (W&B). It handles initialization of W&B runs, logs evaluation results as tables and artifacts, processes and sanitizes metrics dictionaries, and uploads evaluation samples as dataframes. The code includes retry logic for robustness, handles non-serializable objects, and supports both individual task results and grouped task results.

Usage

Use this module when you want to track evaluation experiments in W&B, compare results across multiple runs, visualize metrics in the W&B dashboard, or archive evaluation results and samples as versioned artifacts. Initialize WandbLogger with command-line args, call post_init() after evaluation, then use log_eval_result() and log_eval_samples() to upload data.

Code Reference

Source Location

Repository: EvolvingLMMs_Lab_Lmms_eval
File: lmms_eval/logging_utils.py
Lines: 1-366

Signature

class WandbLogger:
    def __init__(self, args)

    def finish(self)

    def init_run(self)

    def post_init(self, results: Dict[str, Any]) -> None

    def log_eval_result(self) -> None

    def log_eval_samples(self, samples: Dict[str, List[Dict[str, Any]]]) -> None

def remove_none_pattern(input_string: str) -> Tuple[str, bool]

def _handle_non_serializable(o: Any) -> Union[int, str, list]

def get_wandb_printer() -> Literal["Printer"]

Import

from lmms_eval.logging_utils import WandbLogger

I/O Contract

Inputs

Name	Type	Required	Description
args	argparse.Namespace	Yes	Command-line arguments containing wandb_args and evaluation config
results	Dict[str, Any]	Yes	Evaluation results dictionary with 'results', 'configs', 'groups' keys
samples	Dict[str, List[Dict]]	Yes	Per-task evaluation samples with predictions and metrics

Outputs

Name	Type	Description
W&B run	wandb.Run	Initialized W&B run object
Logged metrics	wandb logs	Evaluation metrics logged to W&B dashboard
Tables	wandb.Table	Evaluation results formatted as interactive tables
Artifacts	wandb.Artifact	JSON files with complete results and samples

Usage Examples

Basic Usage

from lmms_eval.logging_utils import WandbLogger

# Initialize logger with command-line args
logger = WandbLogger(args)

# After evaluation completes
logger.post_init(results)

# Log results to W&B
logger.log_eval_result()

# Log individual samples
logger.log_eval_samples(samples)

# Finish the run
logger.finish()

Handling Non-Serializable Objects

import json
from lmms_eval.logging_utils import _handle_non_serializable

results = {
    "metric": np.int64(42),
    "tasks": {"task1", "task2"},
    "complex_obj": CustomObject()
}

# Safely serialize to JSON
json_str = json.dumps(results, default=_handle_non_serializable)

Custom W&B Configuration

# Via command-line args
args.wandb_args = "project=my-project,name=my-run,tags=tag1;tag2"

# Or via environment variables
import os
os.environ["WANDB_PROJECT"] = "my-project"
os.environ["WANDB_MODE"] = "offline"  # For offline mode

logger = WandbLogger(args)

Implementation Details

Metric Sanitization

The logger removes ",none" suffixes from metric names and separates string-valued metrics into wandb.run.summary to ensure numeric metrics can be properly plotted.

Result Organization

Metrics are restructured from nested dictionaries to flat keys like "task_name/metric_name" for better W&B visualization.

Retry Logic

The init_run() method uses tenacity with exponential backoff (5 attempts, 5-second intervals) to handle transient network issues.

Artifact Structure

results: Complete evaluation results as JSON
samples_by_task: Individual task samples as JSON files
tables: Interactive tables for exploration in W&B UI

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment