Implementation:Confident ai Deepeval StepEfficiencyMetric
| Knowledge Sources | |
|---|---|
| Domains | |
| Last Updated | 2026-02-14 09:00 GMT |
Overview
Concrete evaluation metric class that measures whether an AI agent completes tasks with minimal unnecessary steps. The StepEfficiencyMetric analyzes the agent's execution trace to identify redundant operations, circular reasoning, and wasted computation, scoring the overall efficiency of the agent's action sequence.
Description
The StepEfficiencyMetric evaluates the efficiency of the agent's execution path. It examines the trace of steps taken by the agent and uses an LLM-as-judge approach to assess whether each step was necessary and whether the overall execution was reasonably efficient.
Key capabilities:
- Trace-based analysis -- examines the full sequence of agent steps including LLM calls, tool invocations, and reasoning steps.
- Redundancy detection -- identifies unnecessary or repeated operations in the execution trace.
- Efficiency scoring -- produces a continuous score reflecting the proportion of useful steps relative to total steps.
- Reason generation -- provides human-readable explanations of inefficiencies found in the trace.
Usage
Import and instantiate for agent evaluation:
from deepeval.metrics import StepEfficiencyMetric
Code Reference
Source Location
- Repository:
confident-ai/deepeval - File:
deepeval/metrics/step_efficiency/step_efficiency.py(lines 22--225)
Signature
class StepEfficiencyMetric(BaseMetric):
def __init__(
self,
threshold: float = 0.5,
model: Optional[str] = None,
include_reason: bool = True,
async_mode: bool = True,
strict_mode: bool = False,
verbose_mode: bool = False,
):
...
Import
from deepeval.metrics import StepEfficiencyMetric
Parent Class
BaseMetric
I/O Contract
Inputs (Constructor Parameters)
| Name | Type | Default | Description |
|---|---|---|---|
threshold |
float | 0.5 |
Minimum score (0--1) for the evaluation to pass. |
model |
Optional[str] | None |
LLM model to use as the evaluation judge. Falls back to default if not specified. |
include_reason |
bool | True |
Whether to generate a human-readable reason for the score. |
async_mode |
bool | True |
Whether to run evaluation asynchronously. |
strict_mode |
bool | False |
When enabled, scores are binarized to 0 or 1 based on the threshold. |
verbose_mode |
bool | False |
When enabled, prints detailed evaluation information during execution. |
Outputs
| Name | Type | Description |
|---|---|---|
| score | float | A value between 0 and 1 indicating the efficiency of the agent's execution path. |
| reason | Optional[str] | Human-readable explanation identifying inefficiencies (when include_reason=True).
|
| success | bool | Whether the score meets or exceeds the threshold. |
Usage Examples
Example 1: Basic Step Efficiency Evaluation
Create a metric with default settings.
from deepeval.metrics import StepEfficiencyMetric
metric = StepEfficiencyMetric(threshold=0.5)
Example 2: Custom Model and Threshold
Use a specific judge model with a higher efficiency threshold.
from deepeval.metrics import StepEfficiencyMetric
metric = StepEfficiencyMetric(
threshold=0.7,
model="gpt-4o",
include_reason=True,
)
- The
model="gpt-4o"parameter specifies the LLM used for evaluation. - The
threshold=0.7requires a high efficiency score for a passing evaluation.
Example 3: Combined with Framework Instrumentation
Use with a LangChain callback handler for automatic efficiency evaluation.
from deepeval.metrics import StepEfficiencyMetric, TaskCompletionMetric
from deepeval.integrations.langchain import CallbackHandler
handler = CallbackHandler(
metrics=[TaskCompletionMetric(), StepEfficiencyMetric()],
name="my-agent",
)
agent.invoke({"input": "Plan a trip to Tokyo"}, config={"callbacks": [handler]})
- Both task completion and step efficiency are evaluated together, providing a comprehensive view of agent quality.