Implementation:Confident ai Deepeval Evaluate Function
Appearance
Overview
evaluate is the primary batch evaluation function in the deepeval library. It executes one or more evaluation metrics across a collection of test cases, producing an EvaluationResult object containing per-case scores, aggregate statistics, and an optional link to the Confident AI cloud dashboard.
This is an API Doc implementation.
Source
- Repository: Confident AI Deepeval
- File:
deepeval/evaluate/evaluate.py, lines 185-323 - Function:
evaluate
Import
from deepeval import evaluate
Function Signature
def evaluate(
test_cases: Union[List[LLMTestCase], List[ConversationalTestCase]],
metrics: Optional[List[BaseMetric]] = None,
metric_collection: Optional[MetricCollection] = None,
hyperparameters: Optional[Dict[str, Union[str, int, float]]] = None,
identifier: Optional[str] = None,
async_config: Optional[AsyncConfig] = None,
display_config: Optional[DisplayConfig] = None,
cache_config: Optional[CacheConfig] = None,
error_config: Optional[ErrorConfig] = None
) -> EvaluationResult
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
test_cases |
Union[List[LLMTestCase], List[ConversationalTestCase]] |
Yes | -- | The list of test cases to evaluate. Each test case contains the input, actual output, and any additional context required by the metrics. |
metrics |
Optional[List[BaseMetric]] |
No | None |
The list of evaluation metrics to apply to each test case. At least one of metrics or metric_collection must be provided.
|
metric_collection |
Optional[MetricCollection] |
No | None |
An alternative way to specify metrics as a named collection. Useful for reusable metric configurations. |
hyperparameters |
Optional[Dict[str, Union[str, int, float]]] |
No | None |
Key-value pairs of hyperparameters to associate with this evaluation run (e.g., model version, temperature, prompt template ID). Used for experiment tracking on the Confident AI dashboard. |
identifier |
Optional[str] |
No | None |
A unique identifier for this evaluation run. Useful for referencing specific runs in reports and dashboards. |
async_config |
Optional[AsyncConfig] |
No | None |
Configuration for asynchronous evaluation, including concurrency limits and timeout settings. |
display_config |
Optional[DisplayConfig] |
No | None |
Configuration for how results are displayed, including verbose mode and progress indicators. |
cache_config |
Optional[CacheConfig] |
No | None |
Configuration for caching evaluation results to avoid redundant LLM calls. |
error_config |
Optional[ErrorConfig] |
No | None |
Configuration for error handling behavior (e.g., skip on error, retry, fail-fast). |
Input / Output
- Inputs: A list of test cases and a list of metrics (or metric collection) to apply. Optional configuration objects control async behavior, display, caching, and error handling.
- Outputs: An
EvaluationResultobject containing:test_results-- A list ofTestResultobjects, one per test case, each containing per-metric scores and metadata.confident_link-- An optional URL to the evaluation results on the Confident AI cloud dashboard (if authenticated).
Example
Basic Batch Evaluation
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase
from deepeval.evaluate import DisplayConfig
# Define metrics
relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
faithfulness_metric = FaithfulnessMetric(threshold=0.7)
# Create test cases
test_cases = [
LLMTestCase(
input="What is machine learning?",
actual_output="Machine learning is a subset of AI that enables systems to learn from data.",
retrieval_context=["Machine learning is a branch of artificial intelligence focused on building systems that learn from data."]
),
LLMTestCase(
input="Explain neural networks.",
actual_output="Neural networks are computing systems inspired by biological neural networks.",
retrieval_context=["Artificial neural networks are computing systems inspired by the biological neural networks in animal brains."]
)
]
# Run batch evaluation
result = evaluate(
test_cases=test_cases,
metrics=[relevancy_metric, faithfulness_metric],
display_config=DisplayConfig(verbose_mode=True)
)
# Access results
for test_result in result.test_results:
print(test_result.metrics_data)
Evaluation with Hyperparameters
from deepeval import evaluate
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
metric = GEval(
name="Coherence",
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
criteria="Evaluate the coherence of the output",
threshold=0.5
)
result = evaluate(
test_cases=[test_case],
metrics=[metric],
hyperparameters={
"model": "gpt-4o",
"temperature": 0.7,
"prompt_version": "v2.1"
}
)
Related Pages
- Environment:Confident_ai_Deepeval_Python_3_9_Runtime
- Environment:Confident_ai_Deepeval_LLM_Provider_Credentials
- Heuristic:Confident_ai_Deepeval_Timeout_and_Retry_Tuning
- Heuristic:Confident_ai_Deepeval_Async_Concurrency_Tuning
Metadata
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment