Implementation:Togethercomputer Together python Evaluation CLI
| Knowledge Sources | |
|---|---|
| Domains | CLI, Evaluation |
| Last Updated | 2026-02-15 16:00 GMT |
Overview
Concrete CLI tool for creating and managing LLM evaluation jobs from the command line provided by the Together Python SDK.
Description
The evaluation Click command group provides terminal commands for creating evaluation jobs (classify, score, compare), listing jobs with formatted tables, and retrieving job details and status. It supports both simple field-reference mode (pre-generated responses in input data) and detailed model configuration mode (generate responses on the fly).
Usage
Use these CLI commands when managing LLM evaluations from a terminal or shell script rather than Python code.
Code Reference
Source Location
- Repository: Together Python
- File: src/together/cli/api/evaluation.py
- Lines: 1-479
Signature
together evaluation create --type classify|score|compare --judge-model MODEL --judge-model-source SOURCE --judge-system-template TEMPLATE --input-data-file-path PATH [options...]
together evaluation list [--status STATUS] [--limit N]
together evaluation retrieve EVALUATION_ID
together evaluation status EVALUATION_ID
Import
together evaluation <subcommand>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --type | Choice | Yes | Evaluation type: classify, score, or compare |
| --judge-model | str | Yes | Judge model name or URL |
| --judge-model-source | Choice | Yes | Source: serverless, dedicated, or external |
| --judge-system-template | str | Yes | System template for the judge |
| --input-data-file-path | str | Yes | Path to input data file |
| --labels | str | Yes (classify) | Comma-separated classification labels |
| --pass-labels | str | Yes (classify) | Comma-separated passing labels |
| --min-score | float | Yes (score) | Minimum score boundary |
| --max-score | float | Yes (score) | Maximum score boundary |
| --pass-threshold | float | Yes (score) | Passing threshold |
Outputs
| Name | Type | Description |
|---|---|---|
| create output | JSON | Evaluation job creation response with workflow_id |
| list output | Table | Formatted table of evaluation jobs |
| retrieve output | JSON | Full evaluation job details |
| status output | JSON | Current status and results |
Usage Examples
# Create a classify evaluation
together evaluation create \
--type classify \
--judge-model meta-llama/Llama-4-Scout-17B-16E-Instruct \
--judge-model-source serverless \
--judge-system-template "Classify this response." \
--input-data-file-path file-abc123 \
--model-field response \
--labels "good,bad" \
--pass-labels "good"
# List evaluation jobs
together evaluation list --limit 10
# Check status
together evaluation status WORKFLOW_ID