Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Run llama Llama index Eval Utils

From Leeroopedia
Knowledge Sources
Domains Evaluation, Utilities
Last Updated 2026-02-11 19:00 GMT

Overview

Provides utility functions for running batch evaluations, aggregating results into DataFrames, uploading evaluation datasets and results to LlamaCloud, and parsing evaluation responses.

Description

The eval_utils.py module contains a collection of helper functions that support evaluation workflows in LlamaIndex. These are marked as beta functions and may change in future versions.

The module provides the following key functions:

  • aget_responses / get_responses: Asynchronously (or synchronously) queries a BaseQueryEngine with a list of questions. Uses asyncio.gather with optional progress bar support to run all queries concurrently and return the list of responses.
  • get_results_df: Aggregates multiple sets of evaluation results into a pandas.DataFrame. It takes a list of evaluation result dictionaries (each mapping metric names to lists of EvaluationResult objects), a list of experiment names, and a list of metric keys. It computes the mean score for each metric across all results and returns a summary DataFrame.
  • upload_eval_dataset: Uploads evaluation questions to LlamaCloud. It can accept either a direct list of questions or a llama_dataset_id to import from LlamaHub. It supports overwrite and append modes for existing datasets, and uses the llama_cloud client API for project and dataset management.
  • upload_eval_results: Uploads evaluation results (a mapping of metric names to lists of EvaluationResult objects) to LlamaCloud under a specified project and app name.
  • default_parser: A simple parser function that splits an evaluation response string on the first newline to extract a numeric score and the remaining text as reasoning. Returns (None, "No response") for empty strings.

Usage

Use these utilities when running batch evaluation workflows, especially when you need to evaluate a query engine against many questions, aggregate results across multiple experiments, or upload evaluation data to LlamaCloud for tracking and comparison.

Code Reference

Source Location

Signature

async def aget_responses(
    questions: List[str],
    query_engine: BaseQueryEngine,
    show_progress: bool = False,
) -> List[str]

def get_responses(*args: Any, **kwargs: Any) -> List[str]

def get_results_df(
    eval_results_list: List[Dict[str, List[EvaluationResult]]],
    names: List[str],
    metric_keys: List[str],
) -> Any  # pandas.DataFrame

def upload_eval_dataset(
    dataset_name: str,
    questions: Optional[List[str]] = None,
    llama_dataset_id: Optional[str] = None,
    project_name: str = DEFAULT_PROJECT_NAME,
    base_url: Optional[str] = None,
    api_key: Optional[str] = None,
    overwrite: bool = False,
    append: bool = False,
) -> str

def upload_eval_results(
    project_name: str,
    app_name: str,
    results: Dict[str, List[EvaluationResult]],
) -> None

def default_parser(eval_response: str) -> Tuple[Optional[float], Optional[str]]

Import

from llama_index.core.evaluation.eval_utils import (
    aget_responses,
    get_responses,
    get_results_df,
    upload_eval_dataset,
    upload_eval_results,
    default_parser,
)

I/O Contract

Inputs

Name Type Required Description
questions List[str] Yes (aget_responses) List of query strings to evaluate.
query_engine BaseQueryEngine Yes (aget_responses) The query engine to evaluate.
show_progress bool No Whether to show a progress bar. Defaults to False.
eval_results_list List[Dict[str, List[EvaluationResult]]] Yes (get_results_df) List of evaluation result dictionaries to aggregate.
names List[str] Yes (get_results_df) Labels for each set of evaluation results.
metric_keys List[str] Yes (get_results_df) Metric names to include in the DataFrame.
dataset_name str Yes (upload_eval_dataset) Name for the evaluation dataset on LlamaCloud.
project_name str No LlamaCloud project name. Defaults to DEFAULT_PROJECT_NAME.
app_name str Yes (upload_eval_results) LlamaCloud app name for grouping results.
results Dict[str, List[EvaluationResult]] Yes (upload_eval_results) Mapping of metric names to evaluation result lists.
eval_response str Yes (default_parser) Raw string output from an evaluation LLM call.

Outputs

Name Type Description
responses List[str] Query engine responses for each question (from aget_responses/get_responses).
df pandas.DataFrame Summary DataFrame with mean scores per metric per experiment (from get_results_df).
dataset_id str The ID of the created/updated evaluation dataset on LlamaCloud (from upload_eval_dataset).
parsed_result Tuple[Optional[float], Optional[str]] A (score, reasoning) tuple parsed from the evaluation response (from default_parser).

Usage Examples

from llama_index.core.evaluation.eval_utils import (
    get_responses,
    get_results_df,
    upload_eval_results,
)

# Get responses from a query engine
questions = ["What is LlamaIndex?", "How do embeddings work?"]
responses = get_responses(questions, query_engine, show_progress=True)

# Aggregate evaluation results into a DataFrame
df = get_results_df(
    eval_results_list=[baseline_results, improved_results],
    names=["Baseline", "Improved"],
    metric_keys=["answer_relevancy", "context_relevancy"],
)
print(df)

# Upload results to LlamaCloud
upload_eval_results(
    project_name="my_project",
    app_name="my_app",
    results={"answer_relevancy": eval_result_list},
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment