Implementation:Run llama Llama index Eval Utils
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Utilities |
| Last Updated | 2026-02-11 19:00 GMT |
Overview
Provides utility functions for running batch evaluations, aggregating results into DataFrames, uploading evaluation datasets and results to LlamaCloud, and parsing evaluation responses.
Description
The eval_utils.py module contains a collection of helper functions that support evaluation workflows in LlamaIndex. These are marked as beta functions and may change in future versions.
The module provides the following key functions:
- aget_responses / get_responses: Asynchronously (or synchronously) queries a BaseQueryEngine with a list of questions. Uses asyncio.gather with optional progress bar support to run all queries concurrently and return the list of responses.
- get_results_df: Aggregates multiple sets of evaluation results into a pandas.DataFrame. It takes a list of evaluation result dictionaries (each mapping metric names to lists of EvaluationResult objects), a list of experiment names, and a list of metric keys. It computes the mean score for each metric across all results and returns a summary DataFrame.
- upload_eval_dataset: Uploads evaluation questions to LlamaCloud. It can accept either a direct list of questions or a llama_dataset_id to import from LlamaHub. It supports overwrite and append modes for existing datasets, and uses the llama_cloud client API for project and dataset management.
- upload_eval_results: Uploads evaluation results (a mapping of metric names to lists of EvaluationResult objects) to LlamaCloud under a specified project and app name.
- default_parser: A simple parser function that splits an evaluation response string on the first newline to extract a numeric score and the remaining text as reasoning. Returns (None, "No response") for empty strings.
Usage
Use these utilities when running batch evaluation workflows, especially when you need to evaluate a query engine against many questions, aggregate results across multiple experiments, or upload evaluation data to LlamaCloud for tracking and comparison.
Code Reference
Source Location
- Repository: Run_llama_Llama_index
- File: llama-index-core/llama_index/core/evaluation/eval_utils.py
Signature
async def aget_responses(
questions: List[str],
query_engine: BaseQueryEngine,
show_progress: bool = False,
) -> List[str]
def get_responses(*args: Any, **kwargs: Any) -> List[str]
def get_results_df(
eval_results_list: List[Dict[str, List[EvaluationResult]]],
names: List[str],
metric_keys: List[str],
) -> Any # pandas.DataFrame
def upload_eval_dataset(
dataset_name: str,
questions: Optional[List[str]] = None,
llama_dataset_id: Optional[str] = None,
project_name: str = DEFAULT_PROJECT_NAME,
base_url: Optional[str] = None,
api_key: Optional[str] = None,
overwrite: bool = False,
append: bool = False,
) -> str
def upload_eval_results(
project_name: str,
app_name: str,
results: Dict[str, List[EvaluationResult]],
) -> None
def default_parser(eval_response: str) -> Tuple[Optional[float], Optional[str]]
Import
from llama_index.core.evaluation.eval_utils import (
aget_responses,
get_responses,
get_results_df,
upload_eval_dataset,
upload_eval_results,
default_parser,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| questions | List[str] | Yes (aget_responses) | List of query strings to evaluate. |
| query_engine | BaseQueryEngine | Yes (aget_responses) | The query engine to evaluate. |
| show_progress | bool | No | Whether to show a progress bar. Defaults to False. |
| eval_results_list | List[Dict[str, List[EvaluationResult]]] | Yes (get_results_df) | List of evaluation result dictionaries to aggregate. |
| names | List[str] | Yes (get_results_df) | Labels for each set of evaluation results. |
| metric_keys | List[str] | Yes (get_results_df) | Metric names to include in the DataFrame. |
| dataset_name | str | Yes (upload_eval_dataset) | Name for the evaluation dataset on LlamaCloud. |
| project_name | str | No | LlamaCloud project name. Defaults to DEFAULT_PROJECT_NAME. |
| app_name | str | Yes (upload_eval_results) | LlamaCloud app name for grouping results. |
| results | Dict[str, List[EvaluationResult]] | Yes (upload_eval_results) | Mapping of metric names to evaluation result lists. |
| eval_response | str | Yes (default_parser) | Raw string output from an evaluation LLM call. |
Outputs
| Name | Type | Description |
|---|---|---|
| responses | List[str] | Query engine responses for each question (from aget_responses/get_responses). |
| df | pandas.DataFrame | Summary DataFrame with mean scores per metric per experiment (from get_results_df). |
| dataset_id | str | The ID of the created/updated evaluation dataset on LlamaCloud (from upload_eval_dataset). |
| parsed_result | Tuple[Optional[float], Optional[str]] | A (score, reasoning) tuple parsed from the evaluation response (from default_parser). |
Usage Examples
from llama_index.core.evaluation.eval_utils import (
get_responses,
get_results_df,
upload_eval_results,
)
# Get responses from a query engine
questions = ["What is LlamaIndex?", "How do embeddings work?"]
responses = get_responses(questions, query_engine, show_progress=True)
# Aggregate evaluation results into a DataFrame
df = get_results_df(
eval_results_list=[baseline_results, improved_results],
names=["Baseline", "Improved"],
metric_keys=["answer_relevancy", "context_relevancy"],
)
print(df)
# Upload results to LlamaCloud
upload_eval_results(
project_name="my_project",
app_name="my_app",
results={"answer_relevancy": eval_result_list},
)