Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Pytorch Serve Management API

From Leeroopedia

Overview

Management API is the REST-based control plane in TorchServe for dynamic model lifecycle management. It runs on port 8081 by default and provides endpoints for registering, scaling, describing, and unregistering models on a running server. TorchServe also provides Python helper functions (register_model and register_model_with_params) in the ts.launcher module for programmatic access.

Field Value
Implementation Name Management API
Type External Tool Doc
Workflow Model_Deployment
Domains Model_Serving, API_Design
Knowledge Sources TorchServe
Last Updated 2026-02-13 00:00 GMT

Description

The Management API is served by the Java frontend (Netty server) and provides RESTful endpoints following the gRPC ManagementAPIsService proto definition. It requires the --enable-model-api flag to be set at server startup for model registration to work.

Endpoints

Endpoint Method Description Default Port
/models POST Register a new model 8081
/models GET List all registered models 8081
/models/{model_name} GET Describe model status and workers 8081
/models/{model_name} PUT Scale workers for a model 8081
/models/{model_name} DELETE Unregister a model 8081
/models/{model_name}/{version} PUT Scale workers for a specific version 8081
/models/{model_name}/{version} DELETE Unregister a specific version 8081
/models/{model_name}/all GET Describe all versions of a model 8081

Python Helper Functions

The ts.launcher module provides two convenience functions that wrap HTTP calls to the Management API:

  • register_model(model_name, url): Registers a model with 1 initial worker synchronously.
  • register_model_with_params(params): Registers a model with arbitrary parameters.

Usage

from ts.launcher import register_model, register_model_with_params

Or use curl / requests directly against the REST endpoints.

Code Reference

Source Location

File Lines Description Repository
docs/management_api.md L27-188 REST API documentation pytorch/serve
ts/launcher.py L107-119 Python helper functions pytorch/serve

Signature

REST Endpoints

POST /models
  Parameters:
    url (str):              Required. Model archive URL or local .mar filename.
    model_name (str):       Optional. Override model name from manifest.
    handler (str):          Optional. Override handler from manifest.
    runtime (str):          Optional. Runtime type. Default: "PYTHON".
    batch_size (int):       Optional. Inference batch size. Default: 1.
    max_batch_delay (int):  Optional. Max batch wait in ms. Default: 100.
    initial_workers (int):  Optional. Initial worker count. Default: 0.
    synchronous (bool):     Optional. Wait for workers. Default: false.
    response_timeout (int): Optional. Worker response timeout in seconds. Default: 120.
    startup_timeout (int):  Optional. Model load timeout in seconds. Default: 120.

PUT /models/{model_name}
  Parameters:
    min_worker (int):       Optional. Minimum workers. Default: 1.
    max_worker (int):       Optional. Maximum workers. Default: same as min_worker.
    synchronous (bool):     Optional. Wait for scaling. Default: false.
    timeout (int):          Optional. Worker drain timeout in seconds. Default: -1.

DELETE /models/{model_name}

GET /models/{model_name}

GET /models

Python Helpers

def register_model(model_name: str, url: str) -> requests.Response:
    """
    Register a model with 1 initial worker, synchronous.

    Sends POST to http://localhost:8081/models with params:
      model_name, url, initial_workers=1, synchronous=true

    Args:
        model_name (str): Name for the model.
        url (str): URL or local filename of the .mar archive.

    Returns:
        requests.Response: HTTP response from the Management API.
    """
    ...


def register_model_with_params(params) -> requests.Response:
    """
    Register a model with arbitrary parameters.

    Sends POST to http://localhost:8081/models with the given params.

    Args:
        params: dict or list of tuples of query parameters.

    Returns:
        requests.Response: HTTP response from the Management API.
    """
    ...

Import

from ts.launcher import register_model, register_model_with_params

I/O Contract

POST /models (Register)

Input Type Description
Query parameters See signature above Model registration parameters
Response Code Condition Body
200 Synchronous registration success {"status": "Model \"{name}\" Version: {ver} registered with {n} initial workers"}
202 Asynchronous registration accepted {"status": "Processing worker updates..."}
400 Invalid parameters Error message
409 Model already registered Error message

PUT /models/{model_name} (Scale)

Response Code Condition Body
200 Synchronous scaling success {"status": "Workers scaled to {n} for model: {name}"}
202 Asynchronous scaling accepted {"status": "Processing worker updates..."}
404 Model not found Error message

DELETE /models/{model_name} (Unregister)

Response Code Condition Body
200 Model unregistered {"status": "Model \"{name}\" unregistered"}
404 Model not found Error message

GET /models/{model_name} (Describe)

Response Code Body Fields
200 modelName, modelVersion, modelUrl, runtime, minWorkers, maxWorkers, batchSize, maxBatchDelay, workers[] (id, startTime, status, gpu, memoryUsage), jobQueueStatus (remainingCapacity, pendingRequests)

register_model()

Parameter Type Required Description
model_name str Yes Name for the model
url str Yes URL or local .mar filename
Return Type Description
response requests.Response HTTP response from Management API

Usage Examples

Example 1: Register a model from a remote URL

curl -X POST "http://localhost:8081/models?url=https://torchserve.pytorch.org/mar_files/squeezenet1_1.mar"

# Response (202):
{
  "status": "Model \"squeezenet_v1.1\" Version: 1.0 registered with 0 initial workers. Use scale workers API to add workers for the model."
}

Example 2: Register with workers synchronously

curl -v -X POST "http://localhost:8081/models?initial_workers=1&synchronous=true&url=https://torchserve.pytorch.org/mar_files/squeezenet1_1.mar"

# Response (200):
{
  "status": "Model \"squeezenet1_1\" Version: 1.0 registered with 1 initial workers"
}

Example 3: Scale workers

# Scale to 3 workers synchronously
curl -v -X PUT "http://localhost:8081/models/noop?min_worker=3&synchronous=true"

# Response (200):
{
  "status": "Workers scaled to 3 for model: noop"
}

# Scale a specific version
curl -v -X PUT "http://localhost:8081/models/noop/2.0?min_worker=3&synchronous=true"

# Response (200):
{
  "status": "Workers scaled to 3 for model: noop, version: 2.0"
}

Example 4: Describe model status

curl http://localhost:8081/models/noop

# Response (200):
[
  {
    "modelName": "noop",
    "modelVersion": "1.0",
    "modelUrl": "noop.mar",
    "engine": "Torch",
    "runtime": "python",
    "minWorkers": 1,
    "maxWorkers": 1,
    "batchSize": 1,
    "maxBatchDelay": 100,
    "workers": [
      {
        "id": "9000",
        "startTime": "2018-10-02T13:44:53.034Z",
        "status": "READY",
        "gpu": false,
        "memoryUsage": 89247744
      }
    ],
    "jobQueueStatus": {
      "remainingCapacity": 100,
      "pendingRequests": 0
    }
  }
]

Example 5: Python programmatic registration

from ts.launcher import register_model, register_model_with_params

# Simple registration: 1 worker, synchronous
response = register_model("squeezenet", "squeezenet1_1.mar")
print(response.status_code)  # 200
print(response.json())       # {"status": "Model ..."}

# Registration with custom parameters
params = {
    "model_name": "bert",
    "url": "bert.mar",
    "initial_workers": "4",
    "batch_size": "16",
    "max_batch_delay": "200",
    "synchronous": "true",
    "response_timeout": "300",
}
response = register_model_with_params(params)
print(response.status_code)  # 200

Example 6: Full lifecycle workflow

import requests

BASE = "http://localhost:8081"

# 1. Register model
requests.post(f"{BASE}/models", params={
    "url": "resnet18.mar",
    "initial_workers": "2",
    "synchronous": "true",
})

# 2. Check status
status = requests.get(f"{BASE}/models/resnet18").json()
print(f"Workers: {len(status[0]['workers'])}")

# 3. Scale up for peak traffic
requests.put(f"{BASE}/models/resnet18", params={
    "min_worker": "8",
    "synchronous": "true",
})

# 4. Scale down after peak
requests.put(f"{BASE}/models/resnet18", params={
    "min_worker": "2",
    "synchronous": "true",
})

# 5. Unregister when no longer needed
requests.delete(f"{BASE}/models/resnet18")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment