Heuristic:Explodinggradients Ragas Failed Metrics Return NaN
| Knowledge Sources | |
|---|---|
| Domains | LLM_Evaluation, Debugging |
| Last Updated | 2026-02-10 12:00 GMT |
Overview
Ragas returns NaN for failed metric evaluations instead of crashing, ensuring partial results are always available from large evaluation runs.
Description
The Ragas `Executor` wraps each metric evaluation in a try/except block. By default (`raise_exceptions=False`), any exception during metric scoring results in `np.nan` being returned for that sample, while all other samples continue processing. The exception is logged as an error but does not halt the evaluation. This graceful degradation pattern means a 100-row evaluation with 3 failing rows still produces 97 valid scores. The NVIDIA collection metrics use a similar pattern: when all retry attempts are exhausted without a valid rating, they return `float("nan")`.
Usage
Apply this heuristic when interpreting evaluation results with missing values or debugging why some samples have NaN scores. Check the logs for `Exception raised in Job[N]` messages to identify the root cause. Set `raise_exceptions=True` on the Executor if you need strict failure-on-error behavior (e.g., in CI pipelines where partial results are unacceptable).
The Insight (Rule of Thumb)
- Action: Check for NaN values in evaluation results and inspect logs for error messages.
- Value: Default behavior returns `np.nan` for failed metrics; set `raise_exceptions=True` for strict mode.
- Trade-off: Graceful degradation = always get partial results but may silently hide systemic issues. Strict mode = immediately fails on first error but stops all remaining evaluations.
- NaN detection trick: NVIDIA metrics use `score == score` (False for NaN, True for valid floats) as a concise NaN check instead of `math.isnan()`.
- Aggregation: `safe_nanmean()` in `src/ragas/utils.py` computes mean ignoring NaN values, so aggregate scores are based only on successful evaluations.
Reasoning
LLM API calls are inherently unreliable: they can fail due to rate limits, network issues, malformed responses, or provider outages. In a large evaluation run (hundreds or thousands of samples), strict failure would waste all completed work. The NaN pattern allows batch evaluations to complete even when a small percentage of calls fail, while preserving the information about which samples failed for debugging.
The `safe_nanmean()` function ensures that aggregate scores correctly ignore NaN values, so a few failures do not corrupt the overall evaluation metrics.
Code Evidence
Executor NaN fallback from `src/ragas/executor.py:67-84`:
async def wrapped_callable_async(*args, **kwargs) -> t.Tuple[int, t.Any]:
try:
result = await callable(*args, **kwargs)
return counter, result
except Exception as e:
if self.raise_exceptions:
raise e
else:
exec_name = type(e).__name__
exec_message = str(e)
logger.error(
"Exception raised in Job[%s]: %s(%s)",
counter,
exec_name,
exec_message,
exc_info=False,
)
return counter, np.nan
NVIDIA metrics NaN-self-equality check from `src/ragas/metrics/_nv_metrics.py:109-130`:
for retry in range(self.retry):
formatted_prompt = ...
resp = await ...
score = self.process_score(resp.generations[0][0].text)
if score == score: # NaN check (NaN != NaN)
break
else:
logger.warning(f"Retry: {retry}")
safe_nanmean from `src/ragas/utils.py:46-48`:
def safe_nanmean(arr: t.List[float]) -> float:
if len(arr) == 0:
return np.nan