Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Confident ai Deepeval Evaluate Function

From Leeroopedia

Overview

evaluate is the primary batch evaluation function in the deepeval library. It executes one or more evaluation metrics across a collection of test cases, producing an EvaluationResult object containing per-case scores, aggregate statistics, and an optional link to the Confident AI cloud dashboard.

This is an API Doc implementation.

Source

Import

from deepeval import evaluate

Function Signature

def evaluate(
    test_cases: Union[List[LLMTestCase], List[ConversationalTestCase]],
    metrics: Optional[List[BaseMetric]] = None,
    metric_collection: Optional[MetricCollection] = None,
    hyperparameters: Optional[Dict[str, Union[str, int, float]]] = None,
    identifier: Optional[str] = None,
    async_config: Optional[AsyncConfig] = None,
    display_config: Optional[DisplayConfig] = None,
    cache_config: Optional[CacheConfig] = None,
    error_config: Optional[ErrorConfig] = None
) -> EvaluationResult

Parameters

Parameter Type Required Default Description
test_cases Union[List[LLMTestCase], List[ConversationalTestCase]] Yes -- The list of test cases to evaluate. Each test case contains the input, actual output, and any additional context required by the metrics.
metrics Optional[List[BaseMetric]] No None The list of evaluation metrics to apply to each test case. At least one of metrics or metric_collection must be provided.
metric_collection Optional[MetricCollection] No None An alternative way to specify metrics as a named collection. Useful for reusable metric configurations.
hyperparameters Optional[Dict[str, Union[str, int, float]]] No None Key-value pairs of hyperparameters to associate with this evaluation run (e.g., model version, temperature, prompt template ID). Used for experiment tracking on the Confident AI dashboard.
identifier Optional[str] No None A unique identifier for this evaluation run. Useful for referencing specific runs in reports and dashboards.
async_config Optional[AsyncConfig] No None Configuration for asynchronous evaluation, including concurrency limits and timeout settings.
display_config Optional[DisplayConfig] No None Configuration for how results are displayed, including verbose mode and progress indicators.
cache_config Optional[CacheConfig] No None Configuration for caching evaluation results to avoid redundant LLM calls.
error_config Optional[ErrorConfig] No None Configuration for error handling behavior (e.g., skip on error, retry, fail-fast).

Input / Output

  • Inputs: A list of test cases and a list of metrics (or metric collection) to apply. Optional configuration objects control async behavior, display, caching, and error handling.
  • Outputs: An EvaluationResult object containing:
    • test_results -- A list of TestResult objects, one per test case, each containing per-metric scores and metadata.
    • confident_link -- An optional URL to the evaluation results on the Confident AI cloud dashboard (if authenticated).

Example

Basic Batch Evaluation

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase
from deepeval.evaluate import DisplayConfig

# Define metrics
relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
faithfulness_metric = FaithfulnessMetric(threshold=0.7)

# Create test cases
test_cases = [
    LLMTestCase(
        input="What is machine learning?",
        actual_output="Machine learning is a subset of AI that enables systems to learn from data.",
        retrieval_context=["Machine learning is a branch of artificial intelligence focused on building systems that learn from data."]
    ),
    LLMTestCase(
        input="Explain neural networks.",
        actual_output="Neural networks are computing systems inspired by biological neural networks.",
        retrieval_context=["Artificial neural networks are computing systems inspired by the biological neural networks in animal brains."]
    )
]

# Run batch evaluation
result = evaluate(
    test_cases=test_cases,
    metrics=[relevancy_metric, faithfulness_metric],
    display_config=DisplayConfig(verbose_mode=True)
)

# Access results
for test_result in result.test_results:
    print(test_result.metrics_data)

Evaluation with Hyperparameters

from deepeval import evaluate
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

metric = GEval(
    name="Coherence",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    criteria="Evaluate the coherence of the output",
    threshold=0.5
)

result = evaluate(
    test_cases=[test_case],
    metrics=[metric],
    hyperparameters={
        "model": "gpt-4o",
        "temperature": 0.7,
        "prompt_version": "v2.1"
    }
)

Related Pages

Metadata

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment