Implementation:Togethercomputer Together python Endpoints Resource

Knowledge Sources	Together Python
Domains	Infrastructure, Model_Deployment
Last Updated	2026-02-15 16:00 GMT

Overview

Concrete tool for managing dedicated inference endpoints on the Together AI platform provided by the Together Python SDK.

Description

The Endpoints class provides a complete API resource for creating, listing, retrieving, updating, and deleting dedicated inference endpoints. It supports autoscaling configuration, hardware selection, availability zone management, and endpoint lifecycle control. Both synchronous (Endpoints) and asynchronous (AsyncEndpoints) variants are provided.

⚠️ Deprecation Notice: The `disable_prompt_cache` parameter (CLI flag: `--no-prompt-cache`) in `create()` and `update()` is deprecated and will be removed in a future version.

Usage

Import this class when you need to programmatically manage dedicated model deployment endpoints, including creating new endpoints with specific hardware, adjusting autoscaling parameters, or querying available hardware configurations.

Code Reference

Source Location

Repository: Together Python
File: src/together/resources/endpoints.py
Lines: 1-612

Signature

class Endpoints:
    def __init__(self, client: TogetherClient) -> None: ...

    def list(
        self,
        type: Optional[Literal["dedicated", "serverless"]] = None,
        usage_type: Optional[Literal["on-demand", "reserved"]] = None,
        mine: Optional[bool] = None,
    ) -> List[ListEndpoint]: ...

    def create(
        self,
        *,
        model: str,
        hardware: str,
        min_replicas: int,
        max_replicas: int,
        display_name: Optional[str] = None,
        disable_prompt_cache: bool = True,
        disable_speculative_decoding: bool = True,
        state: Literal["STARTED", "STOPPED"] = "STARTED",
        inactive_timeout: Optional[int] = None,
        availability_zone: Optional[str] = None,
    ) -> DedicatedEndpoint: ...

    def get(self, endpoint_id: str) -> DedicatedEndpoint: ...
    def delete(self, endpoint_id: str) -> None: ...

    def update(
        self,
        endpoint_id: str,
        *,
        min_replicas: Optional[int] = None,
        max_replicas: Optional[int] = None,
        state: Optional[Literal["STARTED", "STOPPED"]] = None,
        display_name: Optional[str] = None,
        inactive_timeout: Optional[int] = None,
    ) -> DedicatedEndpoint: ...

    def list_hardware(self, model: Optional[str] = None) -> List[HardwareWithStatus]: ...
    def list_avzones(self) -> List[str]: ...

Import

from together import Together

client = Together()
# Access via client.endpoints

I/O Contract

Inputs

Name	Type	Required	Description
model	str	Yes (create)	The model to deploy on the endpoint
hardware	str	Yes (create)	Hardware configuration ID (e.g., "1x_nvidia_h100_80gb_sxm")
min_replicas	int	Yes (create)	Minimum number of replicas to maintain
max_replicas	int	Yes (create)	Maximum number of replicas to scale to
endpoint_id	str	Yes (get/update/delete)	Unique identifier of the endpoint
type	Literal["dedicated", "serverless"]	No	Filter endpoints by type
state	Literal["STARTED", "STOPPED"]	No	Desired endpoint state
inactive_timeout	int	No	Minutes of inactivity before auto-stop (0 to disable)

Outputs

Name	Type	Description
list() returns	List[ListEndpoint]	List of endpoint summary objects
create() returns	DedicatedEndpoint	Created endpoint details with ID, state, autoscaling config
get() returns	DedicatedEndpoint	Full endpoint details
update() returns	DedicatedEndpoint	Updated endpoint details
list_hardware() returns	List[HardwareWithStatus]	Available hardware configs with pricing and availability
list_avzones() returns	List[str]	Available deployment zones

Usage Examples

Create a Dedicated Endpoint

from together import Together

client = Together()

# Create a dedicated endpoint with autoscaling
endpoint = client.endpoints.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    hardware="1x_nvidia_h100_80gb_sxm",
    min_replicas=1,
    max_replicas=3,
    display_name="My Llama Endpoint",
    state="STARTED",
)

print(f"Endpoint ID: {endpoint.id}")
print(f"State: {endpoint.state}")

List and Manage Endpoints

# List only your dedicated endpoints
my_endpoints = client.endpoints.list(type="dedicated", mine=True)

# Get details for a specific endpoint
details = client.endpoints.get("endpoint-abc123")

# Update autoscaling
updated = client.endpoints.update(
    "endpoint-abc123",
    min_replicas=2,
    max_replicas=5,
)

# Stop an endpoint
client.endpoints.update("endpoint-abc123", state="STOPPED")

# Delete an endpoint
client.endpoints.delete("endpoint-abc123")

Query Hardware and Availability Zones

# List hardware compatible with a specific model
hardware = client.endpoints.list_hardware(model="meta-llama/Llama-4-Scout-17B-16E-Instruct")
for hw in hardware:
    print(f"{hw.id}: {hw.specs.gpu_type} x{hw.specs.gpu_count} - ${hw.pricing.cents_per_minute/100:.2f}/min")

# List availability zones
zones = client.endpoints.list_avzones()
print(f"Available zones: {zones}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment