Implementation:Togethercomputer Together python Evaluation Resource

Knowledge Sources	Together Python
Domains	Evaluation, LLM
Last Updated	2026-02-15 16:00 GMT

Overview

Concrete tool for creating and managing LLM evaluation jobs on the Together AI platform provided by the Together Python SDK.

Description

The Evaluation class provides API methods for running LLM-as-a-judge evaluations. It supports three evaluation types: classify (categorical labeling), score (numeric scoring with thresholds), and compare (pairwise model comparison). Each type uses a configurable judge model to evaluate model outputs against input data. Both synchronous (Evaluation) and asynchronous (AsyncEvaluation) variants are provided.

Usage

Import this class when you need to programmatically evaluate LLM outputs using judge models, compare two models head-to-head, or classify/score model responses against a dataset.

Code Reference

Source Location

Repository: Together Python
File: src/together/resources/evaluation.py
Lines: 1-808

Signature

class Evaluation:
    def __init__(self, client: TogetherClient) -> None: ...

    def create(
        self,
        type: str,
        judge_model: str,
        judge_model_source: str,
        judge_system_template: str,
        input_data_file_path: str,
        judge_external_api_token: Optional[str] = None,
        judge_external_base_url: Optional[str] = None,
        labels: Optional[List[str]] = None,
        pass_labels: Optional[List[str]] = None,
        min_score: Optional[float] = None,
        max_score: Optional[float] = None,
        pass_threshold: Optional[float] = None,
        model_a: Optional[Union[str, Dict[str, Any]]] = None,
        model_b: Optional[Union[str, Dict[str, Any]]] = None,
        model_to_evaluate: Optional[Union[str, Dict[str, Any]]] = None,
    ) -> EvaluationCreateResponse: ...

    def list(
        self,
        status: Optional[str] = None,
        limit: Optional[int] = None,
    ) -> List[EvaluationJob]: ...

    def retrieve(self, evaluation_id: str) -> EvaluationJob: ...
    def status(self, evaluation_id: str) -> EvaluationStatusResponse: ...

Import

from together import Together

client = Together()
# Access via client.evaluation

I/O Contract

Inputs

Name	Type	Required	Description
type	str	Yes	Evaluation type: "classify", "score", or "compare"
judge_model	str	Yes	Name or URL of the judge model
judge_model_source	str	Yes	Source: "serverless", "dedicated", or "external"
judge_system_template	str	Yes	System prompt template for the judge
input_data_file_path	str	Yes	Path to input data file on the platform
labels	List[str]	Yes (classify)	Classification label options
pass_labels	List[str]	Yes (classify)	Labels that count as passing
min_score	float	Yes (score)	Minimum score boundary
max_score	float	Yes (score)	Maximum score boundary
pass_threshold	float	Yes (score)	Score threshold for passing
model_a	Union[str, Dict]	Yes (compare)	First model for comparison
model_b	Union[str, Dict]	Yes (compare)	Second model for comparison

Outputs

Name	Type	Description
create() returns	EvaluationCreateResponse	Contains workflow_id and initial status
list() returns	List[EvaluationJob]	List of evaluation jobs with status and parameters
retrieve() returns	EvaluationJob	Full evaluation job details
status() returns	EvaluationStatusResponse	Current status and results of an evaluation

Usage Examples

Classify Evaluation

from together import Together

client = Together()

# Create a classify evaluation
response = client.evaluation.create(
    type="classify",
    judge_model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    judge_model_source="serverless",
    judge_system_template="Classify the following response as helpful or unhelpful.",
    input_data_file_path="file-abc123",
    labels=["helpful", "unhelpful"],
    pass_labels=["helpful"],
    model_to_evaluate="response_field",
)

print(f"Evaluation started: {response.workflow_id}")

# Check status
status = client.evaluation.status(response.workflow_id)
print(f"Status: {status.status}")

Score Evaluation

response = client.evaluation.create(
    type="score",
    judge_model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    judge_model_source="serverless",
    judge_system_template="Score the quality of this response from 1 to 10.",
    input_data_file_path="file-abc123",
    min_score=1.0,
    max_score=10.0,
    pass_threshold=7.0,
    model_to_evaluate="response_field",
)

Compare Evaluation

response = client.evaluation.create(
    type="compare",
    judge_model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    judge_model_source="serverless",
    judge_system_template="Compare responses A and B. Which is better?",
    input_data_file_path="file-abc123",
    model_a="response_a_field",
    model_b="response_b_field",
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment