Overview
Management API is the REST-based control plane in TorchServe for dynamic model lifecycle management. It runs on port 8081 by default and provides endpoints for registering, scaling, describing, and unregistering models on a running server. TorchServe also provides Python helper functions (register_model and register_model_with_params) in the ts.launcher module for programmatic access.
Description
The Management API is served by the Java frontend (Netty server) and provides RESTful endpoints following the gRPC ManagementAPIsService proto definition. It requires the --enable-model-api flag to be set at server startup for model registration to work.
Endpoints
| Endpoint |
Method |
Description |
Default Port
|
/models |
POST |
Register a new model |
8081
|
/models |
GET |
List all registered models |
8081
|
/models/{model_name} |
GET |
Describe model status and workers |
8081
|
/models/{model_name} |
PUT |
Scale workers for a model |
8081
|
/models/{model_name} |
DELETE |
Unregister a model |
8081
|
/models/{model_name}/{version} |
PUT |
Scale workers for a specific version |
8081
|
/models/{model_name}/{version} |
DELETE |
Unregister a specific version |
8081
|
/models/{model_name}/all |
GET |
Describe all versions of a model |
8081
|
Python Helper Functions
The ts.launcher module provides two convenience functions that wrap HTTP calls to the Management API:
register_model(model_name, url): Registers a model with 1 initial worker synchronously.
register_model_with_params(params): Registers a model with arbitrary parameters.
Usage
from ts.launcher import register_model, register_model_with_params
Or use curl / requests directly against the REST endpoints.
Code Reference
Source Location
| File |
Lines |
Description |
Repository
|
docs/management_api.md |
L27-188 |
REST API documentation |
pytorch/serve
|
ts/launcher.py |
L107-119 |
Python helper functions |
pytorch/serve
|
Signature
REST Endpoints
POST /models
Parameters:
url (str): Required. Model archive URL or local .mar filename.
model_name (str): Optional. Override model name from manifest.
handler (str): Optional. Override handler from manifest.
runtime (str): Optional. Runtime type. Default: "PYTHON".
batch_size (int): Optional. Inference batch size. Default: 1.
max_batch_delay (int): Optional. Max batch wait in ms. Default: 100.
initial_workers (int): Optional. Initial worker count. Default: 0.
synchronous (bool): Optional. Wait for workers. Default: false.
response_timeout (int): Optional. Worker response timeout in seconds. Default: 120.
startup_timeout (int): Optional. Model load timeout in seconds. Default: 120.
PUT /models/{model_name}
Parameters:
min_worker (int): Optional. Minimum workers. Default: 1.
max_worker (int): Optional. Maximum workers. Default: same as min_worker.
synchronous (bool): Optional. Wait for scaling. Default: false.
timeout (int): Optional. Worker drain timeout in seconds. Default: -1.
DELETE /models/{model_name}
GET /models/{model_name}
GET /models
Python Helpers
def register_model(model_name: str, url: str) -> requests.Response:
"""
Register a model with 1 initial worker, synchronous.
Sends POST to http://localhost:8081/models with params:
model_name, url, initial_workers=1, synchronous=true
Args:
model_name (str): Name for the model.
url (str): URL or local filename of the .mar archive.
Returns:
requests.Response: HTTP response from the Management API.
"""
...
def register_model_with_params(params) -> requests.Response:
"""
Register a model with arbitrary parameters.
Sends POST to http://localhost:8081/models with the given params.
Args:
params: dict or list of tuples of query parameters.
Returns:
requests.Response: HTTP response from the Management API.
"""
...
Import
from ts.launcher import register_model, register_model_with_params
I/O Contract
POST /models (Register)
| Input |
Type |
Description
|
| Query parameters |
See signature above |
Model registration parameters
|
| Response Code |
Condition |
Body
|
| 200 |
Synchronous registration success |
{"status": "Model \"{name}\" Version: {ver} registered with {n} initial workers"}
|
| 202 |
Asynchronous registration accepted |
{"status": "Processing worker updates..."}
|
| 400 |
Invalid parameters |
Error message
|
| 409 |
Model already registered |
Error message
|
PUT /models/{model_name} (Scale)
| Response Code |
Condition |
Body
|
| 200 |
Synchronous scaling success |
{"status": "Workers scaled to {n} for model: {name}"}
|
| 202 |
Asynchronous scaling accepted |
{"status": "Processing worker updates..."}
|
| 404 |
Model not found |
Error message
|
DELETE /models/{model_name} (Unregister)
| Response Code |
Condition |
Body
|
| 200 |
Model unregistered |
{"status": "Model \"{name}\" unregistered"}
|
| 404 |
Model not found |
Error message
|
GET /models/{model_name} (Describe)
| Response Code |
Body Fields
|
| 200 |
modelName, modelVersion, modelUrl, runtime, minWorkers, maxWorkers, batchSize, maxBatchDelay, workers[] (id, startTime, status, gpu, memoryUsage), jobQueueStatus (remainingCapacity, pendingRequests)
|
register_model()
| Parameter |
Type |
Required |
Description
|
model_name |
str |
Yes |
Name for the model
|
url |
str |
Yes |
URL or local .mar filename
|
| Return |
Type |
Description
|
| response |
requests.Response |
HTTP response from Management API
|
Usage Examples
Example 1: Register a model from a remote URL
curl -X POST "http://localhost:8081/models?url=https://torchserve.pytorch.org/mar_files/squeezenet1_1.mar"
# Response (202):
{
"status": "Model \"squeezenet_v1.1\" Version: 1.0 registered with 0 initial workers. Use scale workers API to add workers for the model."
}
Example 2: Register with workers synchronously
curl -v -X POST "http://localhost:8081/models?initial_workers=1&synchronous=true&url=https://torchserve.pytorch.org/mar_files/squeezenet1_1.mar"
# Response (200):
{
"status": "Model \"squeezenet1_1\" Version: 1.0 registered with 1 initial workers"
}
Example 3: Scale workers
# Scale to 3 workers synchronously
curl -v -X PUT "http://localhost:8081/models/noop?min_worker=3&synchronous=true"
# Response (200):
{
"status": "Workers scaled to 3 for model: noop"
}
# Scale a specific version
curl -v -X PUT "http://localhost:8081/models/noop/2.0?min_worker=3&synchronous=true"
# Response (200):
{
"status": "Workers scaled to 3 for model: noop, version: 2.0"
}
Example 4: Describe model status
curl http://localhost:8081/models/noop
# Response (200):
[
{
"modelName": "noop",
"modelVersion": "1.0",
"modelUrl": "noop.mar",
"engine": "Torch",
"runtime": "python",
"minWorkers": 1,
"maxWorkers": 1,
"batchSize": 1,
"maxBatchDelay": 100,
"workers": [
{
"id": "9000",
"startTime": "2018-10-02T13:44:53.034Z",
"status": "READY",
"gpu": false,
"memoryUsage": 89247744
}
],
"jobQueueStatus": {
"remainingCapacity": 100,
"pendingRequests": 0
}
}
]
Example 5: Python programmatic registration
from ts.launcher import register_model, register_model_with_params
# Simple registration: 1 worker, synchronous
response = register_model("squeezenet", "squeezenet1_1.mar")
print(response.status_code) # 200
print(response.json()) # {"status": "Model ..."}
# Registration with custom parameters
params = {
"model_name": "bert",
"url": "bert.mar",
"initial_workers": "4",
"batch_size": "16",
"max_batch_delay": "200",
"synchronous": "true",
"response_timeout": "300",
}
response = register_model_with_params(params)
print(response.status_code) # 200
Example 6: Full lifecycle workflow
import requests
BASE = "http://localhost:8081"
# 1. Register model
requests.post(f"{BASE}/models", params={
"url": "resnet18.mar",
"initial_workers": "2",
"synchronous": "true",
})
# 2. Check status
status = requests.get(f"{BASE}/models/resnet18").json()
print(f"Workers: {len(status[0]['workers'])}")
# 3. Scale up for peak traffic
requests.put(f"{BASE}/models/resnet18", params={
"min_worker": "8",
"synchronous": "true",
})
# 4. Scale down after peak
requests.put(f"{BASE}/models/resnet18", params={
"min_worker": "2",
"synchronous": "true",
})
# 5. Unregister when no longer needed
requests.delete(f"{BASE}/models/resnet18")
Related Pages