Implementation:Togethercomputer Together python Endpoints Resource
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Model_Deployment |
| Last Updated | 2026-02-15 16:00 GMT |
Overview
Concrete tool for managing dedicated inference endpoints on the Together AI platform provided by the Together Python SDK.
Description
The Endpoints class provides a complete API resource for creating, listing, retrieving, updating, and deleting dedicated inference endpoints. It supports autoscaling configuration, hardware selection, availability zone management, and endpoint lifecycle control. Both synchronous (Endpoints) and asynchronous (AsyncEndpoints) variants are provided.
⚠️ Deprecation Notice: The `disable_prompt_cache` parameter (CLI flag: `--no-prompt-cache`) in `create()` and `update()` is deprecated and will be removed in a future version.
Usage
Import this class when you need to programmatically manage dedicated model deployment endpoints, including creating new endpoints with specific hardware, adjusting autoscaling parameters, or querying available hardware configurations.
Code Reference
Source Location
- Repository: Together Python
- File: src/together/resources/endpoints.py
- Lines: 1-612
Signature
class Endpoints:
def __init__(self, client: TogetherClient) -> None: ...
def list(
self,
type: Optional[Literal["dedicated", "serverless"]] = None,
usage_type: Optional[Literal["on-demand", "reserved"]] = None,
mine: Optional[bool] = None,
) -> List[ListEndpoint]: ...
def create(
self,
*,
model: str,
hardware: str,
min_replicas: int,
max_replicas: int,
display_name: Optional[str] = None,
disable_prompt_cache: bool = True,
disable_speculative_decoding: bool = True,
state: Literal["STARTED", "STOPPED"] = "STARTED",
inactive_timeout: Optional[int] = None,
availability_zone: Optional[str] = None,
) -> DedicatedEndpoint: ...
def get(self, endpoint_id: str) -> DedicatedEndpoint: ...
def delete(self, endpoint_id: str) -> None: ...
def update(
self,
endpoint_id: str,
*,
min_replicas: Optional[int] = None,
max_replicas: Optional[int] = None,
state: Optional[Literal["STARTED", "STOPPED"]] = None,
display_name: Optional[str] = None,
inactive_timeout: Optional[int] = None,
) -> DedicatedEndpoint: ...
def list_hardware(self, model: Optional[str] = None) -> List[HardwareWithStatus]: ...
def list_avzones(self) -> List[str]: ...
Import
from together import Together
client = Together()
# Access via client.endpoints
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | str | Yes (create) | The model to deploy on the endpoint |
| hardware | str | Yes (create) | Hardware configuration ID (e.g., "1x_nvidia_h100_80gb_sxm") |
| min_replicas | int | Yes (create) | Minimum number of replicas to maintain |
| max_replicas | int | Yes (create) | Maximum number of replicas to scale to |
| endpoint_id | str | Yes (get/update/delete) | Unique identifier of the endpoint |
| type | Literal["dedicated", "serverless"] | No | Filter endpoints by type |
| state | Literal["STARTED", "STOPPED"] | No | Desired endpoint state |
| inactive_timeout | int | No | Minutes of inactivity before auto-stop (0 to disable) |
Outputs
| Name | Type | Description |
|---|---|---|
| list() returns | List[ListEndpoint] | List of endpoint summary objects |
| create() returns | DedicatedEndpoint | Created endpoint details with ID, state, autoscaling config |
| get() returns | DedicatedEndpoint | Full endpoint details |
| update() returns | DedicatedEndpoint | Updated endpoint details |
| list_hardware() returns | List[HardwareWithStatus] | Available hardware configs with pricing and availability |
| list_avzones() returns | List[str] | Available deployment zones |
Usage Examples
Create a Dedicated Endpoint
from together import Together
client = Together()
# Create a dedicated endpoint with autoscaling
endpoint = client.endpoints.create(
model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
hardware="1x_nvidia_h100_80gb_sxm",
min_replicas=1,
max_replicas=3,
display_name="My Llama Endpoint",
state="STARTED",
)
print(f"Endpoint ID: {endpoint.id}")
print(f"State: {endpoint.state}")
List and Manage Endpoints
# List only your dedicated endpoints
my_endpoints = client.endpoints.list(type="dedicated", mine=True)
# Get details for a specific endpoint
details = client.endpoints.get("endpoint-abc123")
# Update autoscaling
updated = client.endpoints.update(
"endpoint-abc123",
min_replicas=2,
max_replicas=5,
)
# Stop an endpoint
client.endpoints.update("endpoint-abc123", state="STOPPED")
# Delete an endpoint
client.endpoints.delete("endpoint-abc123")
Query Hardware and Availability Zones
# List hardware compatible with a specific model
hardware = client.endpoints.list_hardware(model="meta-llama/Llama-4-Scout-17B-16E-Instruct")
for hw in hardware:
print(f"{hw.id}: {hw.specs.gpu_type} x{hw.specs.gpu_count} - ${hw.pricing.cents_per_minute/100:.2f}/min")
# List availability zones
zones = client.endpoints.list_avzones()
print(f"Available zones: {zones}")