Principle:Run llama Llama index Evaluation Result Analysis

Overview

Evaluation Result Analysis is the final stage of the RAG evaluation pipeline, where raw evaluation outputs are transformed into actionable insights. After running evaluators against a set of queries, the results — captured as EvaluationResult objects — must be interpreted, aggregated, and analyzed to understand pipeline quality, identify failure modes, and guide improvements.

The quality of analysis determines whether evaluation is merely a checkbox exercise or a genuine driver of RAG pipeline improvement.

RAG Evaluation Metrics Analysis Quality Assurance Evaluation Reporting

Understanding Pass Rates, Scores, and Feedback

Each EvaluationResult carries three complementary signals:

Pass/Fail Verdict (passing)

A boolean indicating whether the response met the evaluator's quality threshold. This is the most accessible metric:

Faithfulness pass rate — percentage of responses that are fully supported by retrieved context (no hallucination)
Relevancy pass rate — percentage of query-response pairs where the retrieval and generation are on-topic
Correctness pass rate — percentage of responses scoring above the configured threshold (e.g., 4.0 out of 5.0)

Pass rates provide an at-a-glance summary of pipeline quality and are the primary metric for comparing pipeline configurations.

Numeric Scores (score)

Some evaluators (particularly CorrectnessEvaluator) produce numeric scores that capture quality on a continuous scale rather than a binary verdict. Scores enable:

Ranking responses — identifying which queries produce the best and worst responses
Measuring improvement granularity — a change from 3.2 to 3.8 average is meaningful even if pass rates stay the same
Setting variable thresholds — different use cases may require different minimum scores

Textual Feedback (feedback)

The judge LLM provides natural-language explanations for its verdicts. This feedback is the most valuable signal for debugging because it explains why a response failed:

"The response claims the study was published in 2023, but the context only mentions 2022" (faithfulness failure)
"The retrieved context discusses manufacturing processes, which is not relevant to the user's question about pricing" (relevancy failure)
"The response is partially correct but misses key details about the implementation approach" (correctness issue)

Identifying Failure Modes

Systematic analysis of evaluation results reveals patterns in pipeline failures:

Retrieval Failures

When relevancy consistently fails but faithfulness passes, the problem is in retrieval:

The system faithfully generates from whatever context it retrieves
But the retrieved context is not relevant to the query
Fix: improve embedding model, chunking strategy, or retrieval parameters

Generation Failures

When faithfulness fails but relevancy passes, the problem is in generation:

The system retrieves relevant context
But the LLM generates claims not supported by that context
Fix: adjust prompt template, use a more instruction-following model, or add output constraints

Comprehension Failures

When correctness fails but both faithfulness and relevancy pass, the problem is in synthesis:

The right context is retrieved and the response is grounded
But the response fails to capture the essential information needed for a correct answer
Fix: improve the synthesis prompt, increase context window, or adjust chunk size

Systematic Failures

When multiple evaluators fail for the same queries, look for:

Document quality issues — source documents may be ambiguous, contradictory, or incomplete
Query complexity — multi-hop or comparative questions may exceed the pipeline's capabilities
Topic gaps — certain topics in the corpus may lack sufficient coverage for accurate responses

Aggregating Metrics Across Evaluation Runs

Per-Metric Aggregation

For each evaluator, compute:

Metric	Computation	Purpose
Pass rate	Count of passing results / total results	Overall quality summary
Mean score	Average of numeric scores (where available)	Quality level on continuous scale
Score distribution	Histogram or percentile breakdown of scores	Understanding quality variance
Failure rate by category	Group failures by feedback patterns	Identifying systematic issues

Cross-Metric Correlation

Analyzing how metrics correlate reveals pipeline dynamics:

High faithfulness + low relevancy = retrieval problem
High relevancy + low faithfulness = generation problem
High both + low correctness = synthesis or complexity problem

Temporal Tracking

When evaluation runs are performed repeatedly (e.g., after each pipeline change), tracking metrics over time reveals:

Regression detection — a drop in pass rates after a change indicates a regression
Improvement validation — confirming that an intended improvement actually improved metrics
Trend analysis — gradual degradation may indicate data drift or model degradation

Invalid Results

The EvaluationResult includes invalid_result and invalid_reason fields for cases where evaluation itself failed:

The judge LLM produced unparseable output
The evaluation call timed out
The input was malformed

Invalid results should be tracked separately from passing and failing results. A high invalid rate suggests problems with the evaluation configuration rather than the pipeline being evaluated.

Best Practices for Result Analysis

Always examine failures qualitatively — reading feedback text for failed evaluations provides more insight than aggregate metrics alone
Stratify by query type — different question categories (factual, comparative, analytical) may have different quality profiles
Use multiple metrics together — no single metric captures all quality dimensions
Track invalid rates — high invalid rates indicate evaluation configuration problems
Version your evaluation datasets — changing the evaluation set makes cross-run comparisons invalid

Knowledge Sources

LlamaIndex Evaluation LlamaIndex EvaluationResult

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment