Implementation:Microsoft Autogen Studio Eval Datamodel
| Sources | python/packages/autogen-studio/autogenstudio/datamodel/eval.py |
|---|---|
| Domains | Evaluation, Data_Modeling, Agent_Systems |
| Last Updated | 2026-02-11 |
Overview
Description
The Studio Eval Datamodel module defines the core Pydantic data models used throughout AutoGen Studio's evaluation system. This module provides structured representations for evaluation tasks, results, scoring mechanisms, and criteria. It serves as the foundational data layer for AutoGen Studio's evaluation capabilities, enabling type-safe interactions between evaluation runners and judges.
The module includes six primary model classes:
- EvalTask - Defines a task to be evaluated with input, metadata, and optional expected outputs
- EvalRunResult - Captures the result of running an evaluation task including status, timing, and errors
- EvalDimensionScore - Represents a score for a single evaluation dimension with reasoning
- EvalScore - Composite scoring structure aggregating multiple dimension scores
- EvalJudgeCriteria - Specifies criteria for judging evaluation results along specific dimensions
- EvalRunStatus - Enum defining possible evaluation run states
Usage
These data models are used throughout the AutoGen Studio evaluation system. Evaluation runners create EvalRunResult objects from EvalTask inputs, while judges consume these results along with EvalJudgeCriteria to produce EvalScore outputs. The models leverage Pydantic for validation and serialization, ensuring data consistency across the evaluation pipeline.
Code Reference
Source Location
python/packages/autogen-studio/autogenstudio/datamodel/eval.py
Signature
# Core Pydantic Models
class EvalTask(BaseModel):
task_id: UUID | str = Field(default_factory=uuid4)
input: str | Sequence[str | Image]
name: str = ""
description: str = ""
expected_outputs: Optional[List[Any]] = None
metadata: Dict[str, Any] = {}
class EvalRunResult(BaseModel):
result: TaskResult | None = None
status: bool = False
start_time: Optional[datetime] = Field(default=datetime.now())
end_time: Optional[datetime] = None
error: Optional[str] = None
class EvalDimensionScore(BaseModel):
dimension: str
score: float
reason: str
max_value: float
min_value: float
class EvalScore(BaseModel):
overall_score: Optional[float] = None
dimension_scores: List[EvalDimensionScore] = []
reason: Optional[str] = None
max_value: float = 10.0
min_value: float = 0.0
metadata: Dict[str, Any] = {}
class EvalJudgeCriteria(BaseModel):
dimension: str
prompt: str
max_value: float = 10.0
min_value: float = 0.0
metadata: Dict[str, Any] = {}
class EvalRunStatus(str, Enum):
PENDING = "pending"
RUNNING = "running"
COMPLETED = "completed"
FAILED = "failed"
CANCELED = "canceled"
class EvalResult(BaseModel):
task_id: UUID | str
status: EvalRunStatus = EvalRunStatus.PENDING
start_time: Optional[datetime] = Field(default=datetime.now())
end_time: Optional[datetime] = None
Import
from autogenstudio.datamodel.eval import (
EvalTask,
EvalRunResult,
EvalDimensionScore,
EvalScore,
EvalJudgeCriteria,
EvalRunStatus,
EvalResult
)
I/O Contract
Inputs
| Model | Field | Type | Description |
|---|---|---|---|
| EvalTask | task_id | str | Unique identifier for the task (auto-generated if not provided) |
| input | Sequence[str | Image] | Task input - can be text string or sequence of strings/images | |
| name | str | Human-readable name for the task | |
| description | str | Detailed description of the task | |
| expected_outputs | Optional[List[Any]] | Expected outputs for validation purposes | |
| metadata | Dict[str, Any] | Additional metadata about the task | |
| EvalJudgeCriteria | dimension | str | Name of the evaluation dimension (e.g., "relevance", "accuracy") |
| prompt | str | Prompt text for judges to evaluate this dimension | |
| max_value | float | Maximum score value (default: 10.0) | |
| min_value | float | Minimum score value (default: 0.0) | |
| metadata | Dict[str, Any] | Additional metadata for the criteria |
Outputs
| Model | Field | Type | Description |
|---|---|---|---|
| EvalRunResult | result | None | The actual task result from autogen_agentchat |
| status | bool | Success/failure status of the run | |
| start_time | Optional[datetime] | When the evaluation started (auto-generated) | |
| end_time | Optional[datetime] | When the evaluation completed | |
| error | Optional[str] | Error message if status is False | |
| EvalDimensionScore | dimension | str | Name of the evaluated dimension |
| score | float | Numerical score for this dimension | |
| reason | str | Textual explanation for the score | |
| max_value | float | Maximum possible score | |
| min_value | float | Minimum possible score | |
| EvalScore | overall_score | Optional[float] | Aggregate score across all dimensions |
| dimension_scores | List[EvalDimensionScore] | Individual scores for each evaluation dimension | |
| reason | Optional[str] | Overall reasoning for the composite score | |
| max_value | float | Maximum possible overall score | |
| min_value | float | Minimum possible overall score | |
| metadata | Dict[str, Any] | Additional metadata about the scoring |
Usage Examples
Creating an Evaluation Task
from autogenstudio.datamodel.eval import EvalTask
# Simple text task
task = EvalTask(
name="French Capital Query",
description="Test the agent's knowledge of geography",
input="What is the capital of France?",
expected_outputs=["Paris"],
metadata={"category": "geography", "difficulty": "easy"}
)
# Multi-modal task with images
from autogen_core import Image
visual_task = EvalTask(
name="Image Analysis",
input=[
"Describe what you see in this image:",
Image.from_file("path/to/image.png")
],
metadata={"task_type": "vision"}
)
Defining Judge Criteria
from autogenstudio.datamodel.eval import EvalJudgeCriteria
criteria = [
EvalJudgeCriteria(
dimension="relevance",
prompt="Evaluate how relevant the response is to the query.",
max_value=10.0,
min_value=0.0
),
EvalJudgeCriteria(
dimension="accuracy",
prompt="Evaluate the factual accuracy of the response.",
max_value=10.0,
min_value=0.0
),
EvalJudgeCriteria(
dimension="completeness",
prompt="Evaluate whether the response fully addresses the query.",
max_value=10.0,
min_value=0.0
)
]
Working with Evaluation Results
from autogenstudio.datamodel.eval import EvalRunResult, EvalScore, EvalDimensionScore
from datetime import datetime
# Creating a result
result = EvalRunResult(
status=True,
start_time=datetime.now(),
end_time=datetime.now(),
result=task_result # TaskResult from autogen_agentchat
)
# Creating dimension scores
dimension_scores = [
EvalDimensionScore(
dimension="relevance",
score=9.0,
reason="Response directly addresses the query",
max_value=10.0,
min_value=0.0
),
EvalDimensionScore(
dimension="accuracy",
score=10.0,
reason="Factually correct response",
max_value=10.0,
min_value=0.0
)
]
# Composite score
eval_score = EvalScore(
overall_score=9.5,
dimension_scores=dimension_scores,
reason="High quality response with accurate and relevant information"
)
Using EvalRunStatus
from autogenstudio.datamodel.eval import EvalRunStatus, EvalResult
# Track evaluation progress
eval_result = EvalResult(
task_id="task-123",
status=EvalRunStatus.PENDING,
start_time=datetime.now()
)
# Update status as evaluation progresses
eval_result.status = EvalRunStatus.RUNNING
# Mark as completed
eval_result.status = EvalRunStatus.COMPLETED
eval_result.end_time = datetime.now()
Related Pages
- Studio Eval Judges - Judge implementations that consume these data models
- Studio Eval Runners - Runner implementations that produce EvalRunResult
- Studio Datamodel Pydantic Types - Additional Pydantic models for AutoGen Studio
- Evaluation Domain - All evaluation-related implementations
- Data Modeling Domain - Data structure implementations