Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Togethercomputer Together python Evaluation Resource

From Leeroopedia
Knowledge Sources
Domains Evaluation, LLM
Last Updated 2026-02-15 16:00 GMT

Overview

Concrete tool for creating and managing LLM evaluation jobs on the Together AI platform provided by the Together Python SDK.

Description

The Evaluation class provides API methods for running LLM-as-a-judge evaluations. It supports three evaluation types: classify (categorical labeling), score (numeric scoring with thresholds), and compare (pairwise model comparison). Each type uses a configurable judge model to evaluate model outputs against input data. Both synchronous (Evaluation) and asynchronous (AsyncEvaluation) variants are provided.

Usage

Import this class when you need to programmatically evaluate LLM outputs using judge models, compare two models head-to-head, or classify/score model responses against a dataset.

Code Reference

Source Location

Signature

class Evaluation:
    def __init__(self, client: TogetherClient) -> None: ...

    def create(
        self,
        type: str,
        judge_model: str,
        judge_model_source: str,
        judge_system_template: str,
        input_data_file_path: str,
        judge_external_api_token: Optional[str] = None,
        judge_external_base_url: Optional[str] = None,
        labels: Optional[List[str]] = None,
        pass_labels: Optional[List[str]] = None,
        min_score: Optional[float] = None,
        max_score: Optional[float] = None,
        pass_threshold: Optional[float] = None,
        model_a: Optional[Union[str, Dict[str, Any]]] = None,
        model_b: Optional[Union[str, Dict[str, Any]]] = None,
        model_to_evaluate: Optional[Union[str, Dict[str, Any]]] = None,
    ) -> EvaluationCreateResponse: ...

    def list(
        self,
        status: Optional[str] = None,
        limit: Optional[int] = None,
    ) -> List[EvaluationJob]: ...

    def retrieve(self, evaluation_id: str) -> EvaluationJob: ...
    def status(self, evaluation_id: str) -> EvaluationStatusResponse: ...

Import

from together import Together

client = Together()
# Access via client.evaluation

I/O Contract

Inputs

Name Type Required Description
type str Yes Evaluation type: "classify", "score", or "compare"
judge_model str Yes Name or URL of the judge model
judge_model_source str Yes Source: "serverless", "dedicated", or "external"
judge_system_template str Yes System prompt template for the judge
input_data_file_path str Yes Path to input data file on the platform
labels List[str] Yes (classify) Classification label options
pass_labels List[str] Yes (classify) Labels that count as passing
min_score float Yes (score) Minimum score boundary
max_score float Yes (score) Maximum score boundary
pass_threshold float Yes (score) Score threshold for passing
model_a Union[str, Dict] Yes (compare) First model for comparison
model_b Union[str, Dict] Yes (compare) Second model for comparison

Outputs

Name Type Description
create() returns EvaluationCreateResponse Contains workflow_id and initial status
list() returns List[EvaluationJob] List of evaluation jobs with status and parameters
retrieve() returns EvaluationJob Full evaluation job details
status() returns EvaluationStatusResponse Current status and results of an evaluation

Usage Examples

Classify Evaluation

from together import Together

client = Together()

# Create a classify evaluation
response = client.evaluation.create(
    type="classify",
    judge_model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    judge_model_source="serverless",
    judge_system_template="Classify the following response as helpful or unhelpful.",
    input_data_file_path="file-abc123",
    labels=["helpful", "unhelpful"],
    pass_labels=["helpful"],
    model_to_evaluate="response_field",
)

print(f"Evaluation started: {response.workflow_id}")

# Check status
status = client.evaluation.status(response.workflow_id)
print(f"Status: {status.status}")

Score Evaluation

response = client.evaluation.create(
    type="score",
    judge_model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    judge_model_source="serverless",
    judge_system_template="Score the quality of this response from 1 to 10.",
    input_data_file_path="file-abc123",
    min_score=1.0,
    max_score=10.0,
    pass_threshold=7.0,
    model_to_evaluate="response_field",
)

Compare Evaluation

response = client.evaluation.create(
    type="compare",
    judge_model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    judge_model_source="serverless",
    judge_system_template="Compare responses A and B. Which is better?",
    input_data_file_path="file-abc123",
    model_a="response_a_field",
    model_b="response_b_field",
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment