Implementation:Togethercomputer Together python Evaluation Resource
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, LLM |
| Last Updated | 2026-02-15 16:00 GMT |
Overview
Concrete tool for creating and managing LLM evaluation jobs on the Together AI platform provided by the Together Python SDK.
Description
The Evaluation class provides API methods for running LLM-as-a-judge evaluations. It supports three evaluation types: classify (categorical labeling), score (numeric scoring with thresholds), and compare (pairwise model comparison). Each type uses a configurable judge model to evaluate model outputs against input data. Both synchronous (Evaluation) and asynchronous (AsyncEvaluation) variants are provided.
Usage
Import this class when you need to programmatically evaluate LLM outputs using judge models, compare two models head-to-head, or classify/score model responses against a dataset.
Code Reference
Source Location
- Repository: Together Python
- File: src/together/resources/evaluation.py
- Lines: 1-808
Signature
class Evaluation:
def __init__(self, client: TogetherClient) -> None: ...
def create(
self,
type: str,
judge_model: str,
judge_model_source: str,
judge_system_template: str,
input_data_file_path: str,
judge_external_api_token: Optional[str] = None,
judge_external_base_url: Optional[str] = None,
labels: Optional[List[str]] = None,
pass_labels: Optional[List[str]] = None,
min_score: Optional[float] = None,
max_score: Optional[float] = None,
pass_threshold: Optional[float] = None,
model_a: Optional[Union[str, Dict[str, Any]]] = None,
model_b: Optional[Union[str, Dict[str, Any]]] = None,
model_to_evaluate: Optional[Union[str, Dict[str, Any]]] = None,
) -> EvaluationCreateResponse: ...
def list(
self,
status: Optional[str] = None,
limit: Optional[int] = None,
) -> List[EvaluationJob]: ...
def retrieve(self, evaluation_id: str) -> EvaluationJob: ...
def status(self, evaluation_id: str) -> EvaluationStatusResponse: ...
Import
from together import Together
client = Together()
# Access via client.evaluation
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| type | str | Yes | Evaluation type: "classify", "score", or "compare" |
| judge_model | str | Yes | Name or URL of the judge model |
| judge_model_source | str | Yes | Source: "serverless", "dedicated", or "external" |
| judge_system_template | str | Yes | System prompt template for the judge |
| input_data_file_path | str | Yes | Path to input data file on the platform |
| labels | List[str] | Yes (classify) | Classification label options |
| pass_labels | List[str] | Yes (classify) | Labels that count as passing |
| min_score | float | Yes (score) | Minimum score boundary |
| max_score | float | Yes (score) | Maximum score boundary |
| pass_threshold | float | Yes (score) | Score threshold for passing |
| model_a | Union[str, Dict] | Yes (compare) | First model for comparison |
| model_b | Union[str, Dict] | Yes (compare) | Second model for comparison |
Outputs
| Name | Type | Description |
|---|---|---|
| create() returns | EvaluationCreateResponse | Contains workflow_id and initial status |
| list() returns | List[EvaluationJob] | List of evaluation jobs with status and parameters |
| retrieve() returns | EvaluationJob | Full evaluation job details |
| status() returns | EvaluationStatusResponse | Current status and results of an evaluation |
Usage Examples
Classify Evaluation
from together import Together
client = Together()
# Create a classify evaluation
response = client.evaluation.create(
type="classify",
judge_model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
judge_model_source="serverless",
judge_system_template="Classify the following response as helpful or unhelpful.",
input_data_file_path="file-abc123",
labels=["helpful", "unhelpful"],
pass_labels=["helpful"],
model_to_evaluate="response_field",
)
print(f"Evaluation started: {response.workflow_id}")
# Check status
status = client.evaluation.status(response.workflow_id)
print(f"Status: {status.status}")
Score Evaluation
response = client.evaluation.create(
type="score",
judge_model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
judge_model_source="serverless",
judge_system_template="Score the quality of this response from 1 to 10.",
input_data_file_path="file-abc123",
min_score=1.0,
max_score=10.0,
pass_threshold=7.0,
model_to_evaluate="response_field",
)
Compare Evaluation
response = client.evaluation.create(
type="compare",
judge_model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
judge_model_source="serverless",
judge_system_template="Compare responses A and B. Which is better?",
input_data_file_path="file-abc123",
model_a="response_a_field",
model_b="response_b_field",
)