Heuristic:Explodinggradients Ragas Failed Metrics Return NaN

Knowledge Sources	Ragas Internal
Domains	LLM_Evaluation, Debugging
Last Updated	2026-02-10 12:00 GMT

Overview

Ragas returns NaN for failed metric evaluations instead of crashing, ensuring partial results are always available from large evaluation runs.

Description

The Ragas `Executor` wraps each metric evaluation in a try/except block. By default (`raise_exceptions=False`), any exception during metric scoring results in `np.nan` being returned for that sample, while all other samples continue processing. The exception is logged as an error but does not halt the evaluation. This graceful degradation pattern means a 100-row evaluation with 3 failing rows still produces 97 valid scores. The NVIDIA collection metrics use a similar pattern: when all retry attempts are exhausted without a valid rating, they return `float("nan")`.

Usage

Apply this heuristic when interpreting evaluation results with missing values or debugging why some samples have NaN scores. Check the logs for `Exception raised in Job[N]` messages to identify the root cause. Set `raise_exceptions=True` on the Executor if you need strict failure-on-error behavior (e.g., in CI pipelines where partial results are unacceptable).

The Insight (Rule of Thumb)

Action: Check for NaN values in evaluation results and inspect logs for error messages.
Value: Default behavior returns `np.nan` for failed metrics; set `raise_exceptions=True` for strict mode.
Trade-off: Graceful degradation = always get partial results but may silently hide systemic issues. Strict mode = immediately fails on first error but stops all remaining evaluations.
NaN detection trick: NVIDIA metrics use `score == score` (False for NaN, True for valid floats) as a concise NaN check instead of `math.isnan()`.
Aggregation: `safe_nanmean()` in `src/ragas/utils.py` computes mean ignoring NaN values, so aggregate scores are based only on successful evaluations.

Reasoning

LLM API calls are inherently unreliable: they can fail due to rate limits, network issues, malformed responses, or provider outages. In a large evaluation run (hundreds or thousands of samples), strict failure would waste all completed work. The NaN pattern allows batch evaluations to complete even when a small percentage of calls fail, while preserving the information about which samples failed for debugging.

The `safe_nanmean()` function ensures that aggregate scores correctly ignore NaN values, so a few failures do not corrupt the overall evaluation metrics.

Code Evidence

Executor NaN fallback from `src/ragas/executor.py:67-84`:

async def wrapped_callable_async(*args, **kwargs) -> t.Tuple[int, t.Any]:
    try:
        result = await callable(*args, **kwargs)
        return counter, result
    except Exception as e:
        if self.raise_exceptions:
            raise e
        else:
            exec_name = type(e).__name__
            exec_message = str(e)
            logger.error(
                "Exception raised in Job[%s]: %s(%s)",
                counter,
                exec_name,
                exec_message,
                exc_info=False,
            )
        return counter, np.nan

NVIDIA metrics NaN-self-equality check from `src/ragas/metrics/_nv_metrics.py:109-130`:

for retry in range(self.retry):
    formatted_prompt = ...
    resp = await ...
    score = self.process_score(resp.generations[0][0].text)
    if score == score:  # NaN check (NaN != NaN)
        break
    else:
        logger.warning(f"Retry: {retry}")

safe_nanmean from `src/ragas/utils.py:46-48`:

def safe_nanmean(arr: t.List[float]) -> float:
    if len(arr) == 0:
        return np.nan

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment