Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft Autogen Studio Eval Datamodel

From Leeroopedia
Sources python/packages/autogen-studio/autogenstudio/datamodel/eval.py
Domains Evaluation, Data_Modeling, Agent_Systems
Last Updated 2026-02-11

Overview

Description

The Studio Eval Datamodel module defines the core Pydantic data models used throughout AutoGen Studio's evaluation system. This module provides structured representations for evaluation tasks, results, scoring mechanisms, and criteria. It serves as the foundational data layer for AutoGen Studio's evaluation capabilities, enabling type-safe interactions between evaluation runners and judges.

The module includes six primary model classes:

  • EvalTask - Defines a task to be evaluated with input, metadata, and optional expected outputs
  • EvalRunResult - Captures the result of running an evaluation task including status, timing, and errors
  • EvalDimensionScore - Represents a score for a single evaluation dimension with reasoning
  • EvalScore - Composite scoring structure aggregating multiple dimension scores
  • EvalJudgeCriteria - Specifies criteria for judging evaluation results along specific dimensions
  • EvalRunStatus - Enum defining possible evaluation run states

Usage

These data models are used throughout the AutoGen Studio evaluation system. Evaluation runners create EvalRunResult objects from EvalTask inputs, while judges consume these results along with EvalJudgeCriteria to produce EvalScore outputs. The models leverage Pydantic for validation and serialization, ensuring data consistency across the evaluation pipeline.

Code Reference

Source Location

python/packages/autogen-studio/autogenstudio/datamodel/eval.py

Signature

# Core Pydantic Models
class EvalTask(BaseModel):
    task_id: UUID | str = Field(default_factory=uuid4)
    input: str | Sequence[str | Image]
    name: str = ""
    description: str = ""
    expected_outputs: Optional[List[Any]] = None
    metadata: Dict[str, Any] = {}

class EvalRunResult(BaseModel):
    result: TaskResult | None = None
    status: bool = False
    start_time: Optional[datetime] = Field(default=datetime.now())
    end_time: Optional[datetime] = None
    error: Optional[str] = None

class EvalDimensionScore(BaseModel):
    dimension: str
    score: float
    reason: str
    max_value: float
    min_value: float

class EvalScore(BaseModel):
    overall_score: Optional[float] = None
    dimension_scores: List[EvalDimensionScore] = []
    reason: Optional[str] = None
    max_value: float = 10.0
    min_value: float = 0.0
    metadata: Dict[str, Any] = {}

class EvalJudgeCriteria(BaseModel):
    dimension: str
    prompt: str
    max_value: float = 10.0
    min_value: float = 0.0
    metadata: Dict[str, Any] = {}

class EvalRunStatus(str, Enum):
    PENDING = "pending"
    RUNNING = "running"
    COMPLETED = "completed"
    FAILED = "failed"
    CANCELED = "canceled"

class EvalResult(BaseModel):
    task_id: UUID | str
    status: EvalRunStatus = EvalRunStatus.PENDING
    start_time: Optional[datetime] = Field(default=datetime.now())
    end_time: Optional[datetime] = None

Import

from autogenstudio.datamodel.eval import (
    EvalTask,
    EvalRunResult,
    EvalDimensionScore,
    EvalScore,
    EvalJudgeCriteria,
    EvalRunStatus,
    EvalResult
)

I/O Contract

Inputs

Model Field Type Description
EvalTask task_id str Unique identifier for the task (auto-generated if not provided)
input Sequence[str | Image] Task input - can be text string or sequence of strings/images
name str Human-readable name for the task
description str Detailed description of the task
expected_outputs Optional[List[Any]] Expected outputs for validation purposes
metadata Dict[str, Any] Additional metadata about the task
EvalJudgeCriteria dimension str Name of the evaluation dimension (e.g., "relevance", "accuracy")
prompt str Prompt text for judges to evaluate this dimension
max_value float Maximum score value (default: 10.0)
min_value float Minimum score value (default: 0.0)
metadata Dict[str, Any] Additional metadata for the criteria

Outputs

Model Field Type Description
EvalRunResult result None The actual task result from autogen_agentchat
status bool Success/failure status of the run
start_time Optional[datetime] When the evaluation started (auto-generated)
end_time Optional[datetime] When the evaluation completed
error Optional[str] Error message if status is False
EvalDimensionScore dimension str Name of the evaluated dimension
score float Numerical score for this dimension
reason str Textual explanation for the score
max_value float Maximum possible score
min_value float Minimum possible score
EvalScore overall_score Optional[float] Aggregate score across all dimensions
dimension_scores List[EvalDimensionScore] Individual scores for each evaluation dimension
reason Optional[str] Overall reasoning for the composite score
max_value float Maximum possible overall score
min_value float Minimum possible overall score
metadata Dict[str, Any] Additional metadata about the scoring

Usage Examples

Creating an Evaluation Task

from autogenstudio.datamodel.eval import EvalTask

# Simple text task
task = EvalTask(
    name="French Capital Query",
    description="Test the agent's knowledge of geography",
    input="What is the capital of France?",
    expected_outputs=["Paris"],
    metadata={"category": "geography", "difficulty": "easy"}
)

# Multi-modal task with images
from autogen_core import Image

visual_task = EvalTask(
    name="Image Analysis",
    input=[
        "Describe what you see in this image:",
        Image.from_file("path/to/image.png")
    ],
    metadata={"task_type": "vision"}
)

Defining Judge Criteria

from autogenstudio.datamodel.eval import EvalJudgeCriteria

criteria = [
    EvalJudgeCriteria(
        dimension="relevance",
        prompt="Evaluate how relevant the response is to the query.",
        max_value=10.0,
        min_value=0.0
    ),
    EvalJudgeCriteria(
        dimension="accuracy",
        prompt="Evaluate the factual accuracy of the response.",
        max_value=10.0,
        min_value=0.0
    ),
    EvalJudgeCriteria(
        dimension="completeness",
        prompt="Evaluate whether the response fully addresses the query.",
        max_value=10.0,
        min_value=0.0
    )
]

Working with Evaluation Results

from autogenstudio.datamodel.eval import EvalRunResult, EvalScore, EvalDimensionScore
from datetime import datetime

# Creating a result
result = EvalRunResult(
    status=True,
    start_time=datetime.now(),
    end_time=datetime.now(),
    result=task_result  # TaskResult from autogen_agentchat
)

# Creating dimension scores
dimension_scores = [
    EvalDimensionScore(
        dimension="relevance",
        score=9.0,
        reason="Response directly addresses the query",
        max_value=10.0,
        min_value=0.0
    ),
    EvalDimensionScore(
        dimension="accuracy",
        score=10.0,
        reason="Factually correct response",
        max_value=10.0,
        min_value=0.0
    )
]

# Composite score
eval_score = EvalScore(
    overall_score=9.5,
    dimension_scores=dimension_scores,
    reason="High quality response with accurate and relevant information"
)

Using EvalRunStatus

from autogenstudio.datamodel.eval import EvalRunStatus, EvalResult

# Track evaluation progress
eval_result = EvalResult(
    task_id="task-123",
    status=EvalRunStatus.PENDING,
    start_time=datetime.now()
)

# Update status as evaluation progresses
eval_result.status = EvalRunStatus.RUNNING

# Mark as completed
eval_result.status = EvalRunStatus.COMPLETED
eval_result.end_time = datetime.now()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment