Implementation:Run llama Llama index BatchEvalRunner Evaluate Queries
Overview
BatchEvalRunner_Evaluate_Queries documents the three execution methods of the BatchEvalRunner class: evaluate_queries (end-to-end through a query engine), evaluate_responses (with pre-computed Response objects), and evaluate_response_strs (with raw strings and explicit contexts). All three methods return a dictionary mapping evaluator names to lists of EvaluationResult objects.
Principle:Run_llama_Llama_index_Batch_Evaluation_Execution
RAG Evaluation Batch Processing LlamaIndex API
Source File
llama-index-core/llama_index/core/evaluation/batch_runner.py, Lines 350–443
Import Statement
from llama_index.core.evaluation import BatchEvalRunner
Method: evaluate_queries
Runs queries through a query engine and evaluates the responses with all registered evaluators.
Signature
| Parameter | Type | Default | Description |
|---|---|---|---|
| query_engine | BaseQueryEngine |
required | The query engine to evaluate |
| queries | Optional[List[str]] |
None |
List of query strings to execute and evaluate |
| **eval_kwargs_lists | Dict[str, Any] |
— | Additional keyword arguments passed to evaluators (e.g., reference answers for CorrectnessEvaluator) |
Return Type
Dict[str, List[EvaluationResult]] — Dictionary mapping evaluator names (from the runner's evaluators dict) to lists of EvaluationResult objects, one per query.
Example
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.evaluation import (
BatchEvalRunner,
FaithfulnessEvaluator,
RelevancyEvaluator,
)
from llama_index.llms.openai import OpenAI
# Build pipeline
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
# Configure evaluation
judge_llm = OpenAI(model="gpt-4", temperature=0.0)
runner = BatchEvalRunner(
evaluators={
"faithfulness": FaithfulnessEvaluator(llm=judge_llm),
"relevancy": RelevancyEvaluator(llm=judge_llm),
},
workers=2,
show_progress=True,
)
# Execute evaluation
eval_questions = [
"What is the main topic of the document?",
"What are the key findings?",
"How was the study conducted?",
]
results = await runner.aevaluate_queries(
query_engine=query_engine,
queries=eval_questions,
)
# Access results by evaluator name
for question, faith_result in zip(eval_questions, results["faithfulness"]):
print(f"Q: {question}")
print(f" Faithful: {faith_result.passing}")
Method: evaluate_responses
Evaluates pre-existing Response objects without re-running the query engine.
Signature
| Parameter | Type | Default | Description |
|---|---|---|---|
| queries | Optional[List[str]] |
None |
List of query strings corresponding to the responses |
| responses | Optional[List[Response]] |
None |
List of Response objects (including source nodes) to evaluate |
| **eval_kwargs_lists | Dict[str, Any] |
— | Additional keyword arguments passed to evaluators |
Return Type
Dict[str, List[EvaluationResult]]
Example
# Collect responses first (e.g., from a batch run)
queries = ["What is RAG?", "How does indexing work?"]
responses = [query_engine.query(q) for q in queries]
# Evaluate the collected responses
results = await runner.aevaluate_responses(
queries=queries,
responses=responses,
)
for q, rel_result in zip(queries, results["relevancy"]):
print(f"Q: {q} -> Relevant: {rel_result.passing}")
Method: evaluate_response_strs
Evaluates raw response strings with explicitly provided contexts. This is the lowest-level method.
Signature
| Parameter | Type | Default | Description |
|---|---|---|---|
| queries | Optional[List[str]] |
None |
List of query strings |
| response_strs | Optional[List[str]] |
None |
List of response strings (plain text) |
| contexts_list | Optional[List[List[str]]] |
None |
List of context lists, one per query (each is a list of context strings) |
| **eval_kwargs_lists | Dict[str, Any] |
— | Additional keyword arguments passed to evaluators |
Return Type
Dict[str, List[EvaluationResult]]
Example
# Evaluate responses from an external system
queries = ["What is machine learning?"]
response_strs = ["Machine learning is a subset of AI that learns from data."]
contexts_list = [
[
"Machine learning is a branch of artificial intelligence "
"that focuses on building systems that learn from data."
]
]
results = await runner.aevaluate_response_strs(
queries=queries,
response_strs=response_strs,
contexts_list=contexts_list,
)
print(results["faithfulness"][0].passing) # True
Passing Reference Answers for Correctness
The CorrectnessEvaluator requires reference answers. These are passed through eval_kwargs_lists:
from llama_index.core.evaluation import (
BatchEvalRunner,
FaithfulnessEvaluator,
CorrectnessEvaluator,
)
from llama_index.llms.openai import OpenAI
judge_llm = OpenAI(model="gpt-4", temperature=0.0)
runner = BatchEvalRunner(
evaluators={
"faithfulness": FaithfulnessEvaluator(llm=judge_llm),
"correctness": CorrectnessEvaluator(llm=judge_llm),
},
workers=2,
)
queries = ["What is Python?", "What is JavaScript?"]
reference_answers = [
"Python is a high-level programming language.",
"JavaScript is a programming language for web development.",
]
# Pass reference answers for the correctness evaluator
results = await runner.aevaluate_queries(
query_engine=query_engine,
queries=queries,
reference=reference_answers, # Passed as eval_kwargs_list
)
Full End-to-End Evaluation Pipeline
import asyncio
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.evaluation import (
BatchEvalRunner,
FaithfulnessEvaluator,
RelevancyEvaluator,
CorrectnessEvaluator,
)
from llama_index.core.llama_dataset import RagDatasetGenerator
from llama_index.llms.openai import OpenAI
async def run_evaluation():
# Step 1: Load documents and build index
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
# Step 2: Generate evaluation dataset
generator_llm = OpenAI(model="gpt-4", temperature=0.0)
dataset_generator = RagDatasetGenerator.from_documents(
documents=documents,
llm=generator_llm,
num_questions_per_chunk=3,
workers=4,
)
rag_dataset = dataset_generator.generate_dataset_from_nodes()
eval_questions = [ex.query for ex in rag_dataset.examples]
reference_answers = [ex.reference_answer for ex in rag_dataset.examples]
# Step 3: Configure and run batch evaluation
judge_llm = OpenAI(model="gpt-4", temperature=0.0)
runner = BatchEvalRunner(
evaluators={
"faithfulness": FaithfulnessEvaluator(llm=judge_llm),
"relevancy": RelevancyEvaluator(llm=judge_llm),
"correctness": CorrectnessEvaluator(llm=judge_llm),
},
workers=2,
show_progress=True,
)
results = await runner.aevaluate_queries(
query_engine=query_engine,
queries=eval_questions,
reference=reference_answers,
)
# Step 4: Analyze results
for metric_name, metric_results in results.items():
pass_count = sum(1 for r in metric_results if r.passing)
total = len(metric_results)
print(f"{metric_name}: {pass_count}/{total} passed "
f"({pass_count/total*100:.1f}%)")
return results
results = asyncio.run(run_evaluation())
Knowledge Sources
LlamaIndex Evaluation LlamaIndex BatchEvalRunner
Environment:Run_llama_Llama_index_Python_LlamaIndex_Core Heuristic:Run_llama_Llama_index_Batch_Eval_Retry_Strategy Heuristic:Run_llama_Llama_index_Worker_Count_Configuration