Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Run llama Llama index Evaluation Result Analysis

From Leeroopedia

Overview

Evaluation Result Analysis is the final stage of the RAG evaluation pipeline, where raw evaluation outputs are transformed into actionable insights. After running evaluators against a set of queries, the results — captured as EvaluationResult objects — must be interpreted, aggregated, and analyzed to understand pipeline quality, identify failure modes, and guide improvements.

The quality of analysis determines whether evaluation is merely a checkbox exercise or a genuine driver of RAG pipeline improvement.

RAG Evaluation Metrics Analysis Quality Assurance Evaluation Reporting

Understanding Pass Rates, Scores, and Feedback

Each EvaluationResult carries three complementary signals:

Pass/Fail Verdict (passing)

A boolean indicating whether the response met the evaluator's quality threshold. This is the most accessible metric:

  • Faithfulness pass rate — percentage of responses that are fully supported by retrieved context (no hallucination)
  • Relevancy pass rate — percentage of query-response pairs where the retrieval and generation are on-topic
  • Correctness pass rate — percentage of responses scoring above the configured threshold (e.g., 4.0 out of 5.0)

Pass rates provide an at-a-glance summary of pipeline quality and are the primary metric for comparing pipeline configurations.

Numeric Scores (score)

Some evaluators (particularly CorrectnessEvaluator) produce numeric scores that capture quality on a continuous scale rather than a binary verdict. Scores enable:

  • Ranking responses — identifying which queries produce the best and worst responses
  • Measuring improvement granularity — a change from 3.2 to 3.8 average is meaningful even if pass rates stay the same
  • Setting variable thresholds — different use cases may require different minimum scores

Textual Feedback (feedback)

The judge LLM provides natural-language explanations for its verdicts. This feedback is the most valuable signal for debugging because it explains why a response failed:

  • "The response claims the study was published in 2023, but the context only mentions 2022" (faithfulness failure)
  • "The retrieved context discusses manufacturing processes, which is not relevant to the user's question about pricing" (relevancy failure)
  • "The response is partially correct but misses key details about the implementation approach" (correctness issue)

Identifying Failure Modes

Systematic analysis of evaluation results reveals patterns in pipeline failures:

Retrieval Failures

When relevancy consistently fails but faithfulness passes, the problem is in retrieval:

  • The system faithfully generates from whatever context it retrieves
  • But the retrieved context is not relevant to the query
  • Fix: improve embedding model, chunking strategy, or retrieval parameters

Generation Failures

When faithfulness fails but relevancy passes, the problem is in generation:

  • The system retrieves relevant context
  • But the LLM generates claims not supported by that context
  • Fix: adjust prompt template, use a more instruction-following model, or add output constraints

Comprehension Failures

When correctness fails but both faithfulness and relevancy pass, the problem is in synthesis:

  • The right context is retrieved and the response is grounded
  • But the response fails to capture the essential information needed for a correct answer
  • Fix: improve the synthesis prompt, increase context window, or adjust chunk size

Systematic Failures

When multiple evaluators fail for the same queries, look for:

  • Document quality issues — source documents may be ambiguous, contradictory, or incomplete
  • Query complexity — multi-hop or comparative questions may exceed the pipeline's capabilities
  • Topic gaps — certain topics in the corpus may lack sufficient coverage for accurate responses

Aggregating Metrics Across Evaluation Runs

Per-Metric Aggregation

For each evaluator, compute:

Metric Computation Purpose
Pass rate Count of passing results / total results Overall quality summary
Mean score Average of numeric scores (where available) Quality level on continuous scale
Score distribution Histogram or percentile breakdown of scores Understanding quality variance
Failure rate by category Group failures by feedback patterns Identifying systematic issues

Cross-Metric Correlation

Analyzing how metrics correlate reveals pipeline dynamics:

  • High faithfulness + low relevancy = retrieval problem
  • High relevancy + low faithfulness = generation problem
  • High both + low correctness = synthesis or complexity problem

Temporal Tracking

When evaluation runs are performed repeatedly (e.g., after each pipeline change), tracking metrics over time reveals:

  • Regression detection — a drop in pass rates after a change indicates a regression
  • Improvement validation — confirming that an intended improvement actually improved metrics
  • Trend analysis — gradual degradation may indicate data drift or model degradation

Invalid Results

The EvaluationResult includes invalid_result and invalid_reason fields for cases where evaluation itself failed:

  • The judge LLM produced unparseable output
  • The evaluation call timed out
  • The input was malformed

Invalid results should be tracked separately from passing and failing results. A high invalid rate suggests problems with the evaluation configuration rather than the pipeline being evaluated.

Best Practices for Result Analysis

  • Always examine failures qualitatively — reading feedback text for failed evaluations provides more insight than aggregate metrics alone
  • Stratify by query type — different question categories (factual, comparative, analytical) may have different quality profiles
  • Use multiple metrics together — no single metric captures all quality dimensions
  • Track invalid rates — high invalid rates indicate evaluation configuration problems
  • Version your evaluation datasets — changing the evaluation set makes cross-run comparisons invalid

Knowledge Sources

LlamaIndex Evaluation LlamaIndex EvaluationResult

Related

2026-02-11 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment