Workflow:Run llama Llama index Evaluation Pipeline
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Evaluation, RAG |
| Last Updated | 2026-02-11 19:00 GMT |
Overview
End-to-end process for systematically evaluating RAG pipeline outputs using multiple LLM-based metrics run in parallel via the BatchEvalRunner.
Description
This workflow provides a systematic approach to measuring the quality of LlamaIndex RAG pipelines. It uses LLM-based evaluators to assess faithfulness (is the answer grounded in context?), relevancy (does the answer address the query?), correctness (does it match a reference answer?), and other metrics. The BatchEvalRunner orchestrates parallel evaluation across multiple metrics and queries, with retry logic and configurable concurrency. Results can be aggregated for comparison between different pipeline configurations.
Usage
Execute this workflow when you need to quantitatively assess the quality of a RAG pipeline, compare different configurations (models, chunk sizes, retrieval strategies), or validate that changes have not degraded performance. This is essential before deploying a pipeline to production or after fine-tuning models.
Execution Steps
Step 1: Generate Evaluation Dataset
Create a set of test queries with optional reference answers and expected contexts. This can be done manually for high-quality evaluation, or automatically using DatasetGenerator or RagDatasetGenerator which use an LLM to generate questions from the document corpus.
Key considerations:
- Manual datasets provide higher quality but require domain expertise
- LLM-generated datasets scale better but may contain noise
- Include diverse query types (factual, analytical, comparative)
- Reference answers enable correctness evaluation
Step 2: Configure Evaluators
Instantiate the evaluator classes for each metric to assess. Each evaluator takes an LLM instance that performs the evaluation judgment. Available evaluators include FaithfulnessEvaluator, RelevancyEvaluator, CorrectnessEvaluator, ContextRelevancyEvaluator, AnswerRelevancyEvaluator, SemanticSimilarityEvaluator, and GuidelineEvaluator.
Key considerations:
- FaithfulnessEvaluator checks if the response is supported by the retrieved context
- RelevancyEvaluator checks if query and response are relevant to retrieved context
- CorrectnessEvaluator compares against a reference answer (requires ground truth)
- SemanticSimilarityEvaluator uses embeddings rather than LLM judgment
- Use a strong LLM (e.g., GPT-4) as the evaluation judge for better accuracy
Step 3: Create Batch Runner
Instantiate BatchEvalRunner with the dictionary of evaluators and configure concurrency settings. The runner manages parallel execution of evaluations using asyncio with a semaphore to limit concurrent API calls.
Key considerations:
- The workers parameter controls maximum concurrent evaluations (default: 2)
- Higher worker counts speed up evaluation but increase API rate pressure
- show_progress=True enables progress bars for monitoring
- Retry logic handles transient API failures (3 attempts, exponential backoff)
Step 4: Run Evaluation
Execute evaluations using one of three methods: evaluate_response_strs() for pre-computed string responses, evaluate_responses() for Response objects with source nodes, or evaluate_queries() for end-to-end evaluation that runs queries through a query engine first.
Key considerations:
- evaluate_queries() is the most convenient for end-to-end testing
- evaluate_responses() preserves source node information for context-based metrics
- Per-evaluator kwargs can be passed for evaluators that need additional inputs
- All methods return Dict[str, List[EvaluationResult]]
Step 5: Analyze Results
Process the evaluation results to compute aggregate metrics and identify failure patterns. Each EvaluationResult contains a score, a passing boolean, and textual feedback explaining the judgment. Aggregate across queries to get overall pipeline quality scores.
Key considerations:
- Results are keyed by evaluator name with a list of results per query
- Scores are typically 0.0 to 1.0 or binary (pass/fail)
- Feedback text explains why a particular judgment was made
- Compare metrics across configurations to guide optimization decisions