Implementation:Confident ai Deepeval Evaluate Function

Overview

evaluate is the primary batch evaluation function in the deepeval library. It executes one or more evaluation metrics across a collection of test cases, producing an EvaluationResult object containing per-case scores, aggregate statistics, and an optional link to the Confident AI cloud dashboard.

This is an API Doc implementation.

Source

Repository: Confident AI Deepeval
File: deepeval/evaluate/evaluate.py, lines 185-323
Function: evaluate

Import

from deepeval import evaluate

Function Signature

def evaluate(
    test_cases: Union[List[LLMTestCase], List[ConversationalTestCase]],
    metrics: Optional[List[BaseMetric]] = None,
    metric_collection: Optional[MetricCollection] = None,
    hyperparameters: Optional[Dict[str, Union[str, int, float]]] = None,
    identifier: Optional[str] = None,
    async_config: Optional[AsyncConfig] = None,
    display_config: Optional[DisplayConfig] = None,
    cache_config: Optional[CacheConfig] = None,
    error_config: Optional[ErrorConfig] = None
) -> EvaluationResult

Parameters

Parameter	Type	Required	Default	Description
`test_cases`	`Union[List[LLMTestCase], List[ConversationalTestCase]]`	Yes	--	The list of test cases to evaluate. Each test case contains the input, actual output, and any additional context required by the metrics.
`metrics`	`Optional[List[BaseMetric]]`	No	`None`	The list of evaluation metrics to apply to each test case. At least one of `metrics` or `metric_collection` must be provided.
`metric_collection`	`Optional[MetricCollection]`	No	`None`	An alternative way to specify metrics as a named collection. Useful for reusable metric configurations.
`hyperparameters`	`Optional[Dict[str, Union[str, int, float]]]`	No	`None`	Key-value pairs of hyperparameters to associate with this evaluation run (e.g., model version, temperature, prompt template ID). Used for experiment tracking on the Confident AI dashboard.
`identifier`	`Optional[str]`	No	`None`	A unique identifier for this evaluation run. Useful for referencing specific runs in reports and dashboards.
`async_config`	`Optional[AsyncConfig]`	No	`None`	Configuration for asynchronous evaluation, including concurrency limits and timeout settings.
`display_config`	`Optional[DisplayConfig]`	No	`None`	Configuration for how results are displayed, including verbose mode and progress indicators.
`cache_config`	`Optional[CacheConfig]`	No	`None`	Configuration for caching evaluation results to avoid redundant LLM calls.
`error_config`	`Optional[ErrorConfig]`	No	`None`	Configuration for error handling behavior (e.g., skip on error, retry, fail-fast).

Input / Output

Inputs: A list of test cases and a list of metrics (or metric collection) to apply. Optional configuration objects control async behavior, display, caching, and error handling.
Outputs: An EvaluationResult object containing:
- test_results -- A list of TestResult objects, one per test case, each containing per-metric scores and metadata.
- confident_link -- An optional URL to the evaluation results on the Confident AI cloud dashboard (if authenticated).

Example

Basic Batch Evaluation

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase
from deepeval.evaluate import DisplayConfig

# Define metrics
relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
faithfulness_metric = FaithfulnessMetric(threshold=0.7)

# Create test cases
test_cases = [
    LLMTestCase(
        input="What is machine learning?",
        actual_output="Machine learning is a subset of AI that enables systems to learn from data.",
        retrieval_context=["Machine learning is a branch of artificial intelligence focused on building systems that learn from data."]
    ),
    LLMTestCase(
        input="Explain neural networks.",
        actual_output="Neural networks are computing systems inspired by biological neural networks.",
        retrieval_context=["Artificial neural networks are computing systems inspired by the biological neural networks in animal brains."]
    )
]

# Run batch evaluation
result = evaluate(
    test_cases=test_cases,
    metrics=[relevancy_metric, faithfulness_metric],
    display_config=DisplayConfig(verbose_mode=True)
)

# Access results
for test_result in result.test_results:
    print(test_result.metrics_data)

Evaluation with Hyperparameters

from deepeval import evaluate
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

metric = GEval(
    name="Coherence",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    criteria="Evaluate the coherence of the output",
    threshold=0.5
)

result = evaluate(
    test_cases=[test_case],
    metrics=[metric],
    hyperparameters={
        "model": "gpt-4o",
        "temperature": 0.7,
        "prompt_version": "v2.1"
    }
)

Related Pages

Metadata

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment