Workflow:Run llama Llama index Evaluation Pipeline

Knowledge Sources	LlamaIndex LlamaIndex Docs
Domains	LLMs, Evaluation, RAG
Last Updated	2026-02-11 19:00 GMT

Overview

End-to-end process for systematically evaluating RAG pipeline outputs using multiple LLM-based metrics run in parallel via the BatchEvalRunner.

Description

This workflow provides a systematic approach to measuring the quality of LlamaIndex RAG pipelines. It uses LLM-based evaluators to assess faithfulness (is the answer grounded in context?), relevancy (does the answer address the query?), correctness (does it match a reference answer?), and other metrics. The BatchEvalRunner orchestrates parallel evaluation across multiple metrics and queries, with retry logic and configurable concurrency. Results can be aggregated for comparison between different pipeline configurations.

Usage

Execute this workflow when you need to quantitatively assess the quality of a RAG pipeline, compare different configurations (models, chunk sizes, retrieval strategies), or validate that changes have not degraded performance. This is essential before deploying a pipeline to production or after fine-tuning models.

Execution Steps

Step 1: Generate Evaluation Dataset

Create a set of test queries with optional reference answers and expected contexts. This can be done manually for high-quality evaluation, or automatically using DatasetGenerator or RagDatasetGenerator which use an LLM to generate questions from the document corpus.

Key considerations:

Manual datasets provide higher quality but require domain expertise
LLM-generated datasets scale better but may contain noise
Include diverse query types (factual, analytical, comparative)
Reference answers enable correctness evaluation

Step 2: Configure Evaluators

Instantiate the evaluator classes for each metric to assess. Each evaluator takes an LLM instance that performs the evaluation judgment. Available evaluators include FaithfulnessEvaluator, RelevancyEvaluator, CorrectnessEvaluator, ContextRelevancyEvaluator, AnswerRelevancyEvaluator, SemanticSimilarityEvaluator, and GuidelineEvaluator.

Key considerations:

FaithfulnessEvaluator checks if the response is supported by the retrieved context
RelevancyEvaluator checks if query and response are relevant to retrieved context
CorrectnessEvaluator compares against a reference answer (requires ground truth)
SemanticSimilarityEvaluator uses embeddings rather than LLM judgment
Use a strong LLM (e.g., GPT-4) as the evaluation judge for better accuracy

Step 3: Create Batch Runner

Instantiate BatchEvalRunner with the dictionary of evaluators and configure concurrency settings. The runner manages parallel execution of evaluations using asyncio with a semaphore to limit concurrent API calls.

Key considerations:

The workers parameter controls maximum concurrent evaluations (default: 2)
Higher worker counts speed up evaluation but increase API rate pressure
show_progress=True enables progress bars for monitoring
Retry logic handles transient API failures (3 attempts, exponential backoff)

Step 4: Run Evaluation

Execute evaluations using one of three methods: evaluate_response_strs() for pre-computed string responses, evaluate_responses() for Response objects with source nodes, or evaluate_queries() for end-to-end evaluation that runs queries through a query engine first.

Key considerations:

evaluate_queries() is the most convenient for end-to-end testing
evaluate_responses() preserves source node information for context-based metrics
Per-evaluator kwargs can be passed for evaluators that need additional inputs
All methods return Dict[str, List[EvaluationResult]]

Step 5: Analyze Results

Process the evaluation results to compute aggregate metrics and identify failure patterns. Each EvaluationResult contains a score, a passing boolean, and textual feedback explaining the judgment. Aggregate across queries to get overall pipeline quality scores.

Key considerations:

Results are keyed by evaluator name with a list of results per query
Scores are typically 0.0 to 1.0 or binary (pass/fail)
Feedback text explains why a particular judgment was made
Compare metrics across configurations to guide optimization decisions

Execution Diagram

GitHub URL

Workflow Repository