Principle:Run llama Llama index Evaluation Result Analysis
Overview
Evaluation Result Analysis is the final stage of the RAG evaluation pipeline, where raw evaluation outputs are transformed into actionable insights. After running evaluators against a set of queries, the results — captured as EvaluationResult objects — must be interpreted, aggregated, and analyzed to understand pipeline quality, identify failure modes, and guide improvements.
The quality of analysis determines whether evaluation is merely a checkbox exercise or a genuine driver of RAG pipeline improvement.
RAG Evaluation Metrics Analysis Quality Assurance Evaluation Reporting
Understanding Pass Rates, Scores, and Feedback
Each EvaluationResult carries three complementary signals:
Pass/Fail Verdict (passing)
A boolean indicating whether the response met the evaluator's quality threshold. This is the most accessible metric:
- Faithfulness pass rate — percentage of responses that are fully supported by retrieved context (no hallucination)
- Relevancy pass rate — percentage of query-response pairs where the retrieval and generation are on-topic
- Correctness pass rate — percentage of responses scoring above the configured threshold (e.g., 4.0 out of 5.0)
Pass rates provide an at-a-glance summary of pipeline quality and are the primary metric for comparing pipeline configurations.
Numeric Scores (score)
Some evaluators (particularly CorrectnessEvaluator) produce numeric scores that capture quality on a continuous scale rather than a binary verdict. Scores enable:
- Ranking responses — identifying which queries produce the best and worst responses
- Measuring improvement granularity — a change from 3.2 to 3.8 average is meaningful even if pass rates stay the same
- Setting variable thresholds — different use cases may require different minimum scores
Textual Feedback (feedback)
The judge LLM provides natural-language explanations for its verdicts. This feedback is the most valuable signal for debugging because it explains why a response failed:
- "The response claims the study was published in 2023, but the context only mentions 2022" (faithfulness failure)
- "The retrieved context discusses manufacturing processes, which is not relevant to the user's question about pricing" (relevancy failure)
- "The response is partially correct but misses key details about the implementation approach" (correctness issue)
Identifying Failure Modes
Systematic analysis of evaluation results reveals patterns in pipeline failures:
Retrieval Failures
When relevancy consistently fails but faithfulness passes, the problem is in retrieval:
- The system faithfully generates from whatever context it retrieves
- But the retrieved context is not relevant to the query
- Fix: improve embedding model, chunking strategy, or retrieval parameters
Generation Failures
When faithfulness fails but relevancy passes, the problem is in generation:
- The system retrieves relevant context
- But the LLM generates claims not supported by that context
- Fix: adjust prompt template, use a more instruction-following model, or add output constraints
Comprehension Failures
When correctness fails but both faithfulness and relevancy pass, the problem is in synthesis:
- The right context is retrieved and the response is grounded
- But the response fails to capture the essential information needed for a correct answer
- Fix: improve the synthesis prompt, increase context window, or adjust chunk size
Systematic Failures
When multiple evaluators fail for the same queries, look for:
- Document quality issues — source documents may be ambiguous, contradictory, or incomplete
- Query complexity — multi-hop or comparative questions may exceed the pipeline's capabilities
- Topic gaps — certain topics in the corpus may lack sufficient coverage for accurate responses
Aggregating Metrics Across Evaluation Runs
Per-Metric Aggregation
For each evaluator, compute:
| Metric | Computation | Purpose |
|---|---|---|
| Pass rate | Count of passing results / total results | Overall quality summary |
| Mean score | Average of numeric scores (where available) | Quality level on continuous scale |
| Score distribution | Histogram or percentile breakdown of scores | Understanding quality variance |
| Failure rate by category | Group failures by feedback patterns | Identifying systematic issues |
Cross-Metric Correlation
Analyzing how metrics correlate reveals pipeline dynamics:
- High faithfulness + low relevancy = retrieval problem
- High relevancy + low faithfulness = generation problem
- High both + low correctness = synthesis or complexity problem
Temporal Tracking
When evaluation runs are performed repeatedly (e.g., after each pipeline change), tracking metrics over time reveals:
- Regression detection — a drop in pass rates after a change indicates a regression
- Improvement validation — confirming that an intended improvement actually improved metrics
- Trend analysis — gradual degradation may indicate data drift or model degradation
Invalid Results
The EvaluationResult includes invalid_result and invalid_reason fields for cases where evaluation itself failed:
- The judge LLM produced unparseable output
- The evaluation call timed out
- The input was malformed
Invalid results should be tracked separately from passing and failing results. A high invalid rate suggests problems with the evaluation configuration rather than the pipeline being evaluated.
Best Practices for Result Analysis
- Always examine failures qualitatively — reading feedback text for failed evaluations provides more insight than aggregate metrics alone
- Stratify by query type — different question categories (factual, comparative, analytical) may have different quality profiles
- Use multiple metrics together — no single metric captures all quality dimensions
- Track invalid rates — high invalid rates indicate evaluation configuration problems
- Version your evaluation datasets — changing the evaluation set makes cross-run comparisons invalid
Knowledge Sources
LlamaIndex Evaluation LlamaIndex EvaluationResult