Heuristic:EvolvingLMMs Lab Lmms eval Bootstrap Iteration Optimization

Knowledge Sources	lmms-eval Empirical optimization of evaluation pipeline
Domains	Statistics, Optimization
Last Updated	2026-02-14 00:00 GMT

Overview

Reducing bootstrap iterations from 100,000 to 100 for computationally expensive NLP metrics (BLEU, CHRF, TER) to avoid prohibitive evaluation times.

Description

The lmms-eval framework uses bootstrap resampling to compute standard error estimates for evaluation metrics. The default is 100,000 iterations, which provides statistically robust estimates. However, for certain NLP metrics — BLEU, CHRF, and TER — which involve expensive n-gram matching computations, running 100,000 iterations is computationally prohibitive. The framework automatically caps these metrics at 100 bootstrap iterations, which provides reasonable error estimates with dramatically reduced computation time.

Usage

This heuristic is automatically applied when computing standard errors for BLEU, CHRF, or TER metrics. Users can control the overall bootstrap iteration count via the bootstrap_iters parameter (default: 100,000), but the cap at 100 for these specific metrics is hardcoded.

The Insight (Rule of Thumb)

Action: Cap bootstrap iterations at 100 for BLEU, CHRF, and TER metrics.
Value: min(bootstrap_iters, 100) for these three metrics; full bootstrap_iters (default 100,000) for all others.
Trade-off: Less precise standard error estimates for translation metrics, but prevents evaluation from taking hours/days on stderr computation alone.

Reasoning

BLEU, CHRF, and TER all involve complex n-gram matching, alignment scoring, and/or edit distance computations. Each bootstrap iteration requires re-computing the full metric on a resampled dataset. At 100,000 iterations, this becomes the dominant bottleneck in the evaluation pipeline, far exceeding the time spent on actual model inference. Empirically, 100 iterations provide standard error estimates that are within acceptable precision for practical use.

The framework also computes CLT-based (Central Limit Theorem) standard errors as an alternative, which require no bootstrap iterations at all. These provide a faster but less robust estimate.

Code evidence from lmms_eval/evaluator_utils.py:128:

bootstrap_iters=min(bootstrap_iters, 100) if metric in ["bleu", "chrf", "ter"] else bootstrap_iters,

Default bootstrap iteration count from lmms_eval/evaluator.py:67:

bootstrap_iters: int = 100000,

CLT-based alternative from lmms_eval/evaluator_utils.py:155-156:

# Naive CLT stderr: std / sqrt(n)
self.agg_metrics[f"{metric}_stderr_clt,{filter_key}"] = np.std(numeric_items, ddof=1) / np.sqrt(n) if n > 1 else "N/A"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment