Principle:Pytorch Serve Model Registration
Overview
Model Registration is the principle of dynamic model lifecycle management through a REST API in TorchServe. It enables registering models on a running server, scaling worker processes up or down, querying model status, and unregistering models -- all without restarting the server. This dynamic management capability is essential for production environments where models must be updated, scaled, and retired in response to changing traffic patterns and business requirements.
| Field | Value |
|---|---|
| Principle Name | Model Registration |
| Workflow | Model_Deployment |
| Domains | Model_Serving, API_Design |
| Knowledge Sources | TorchServe |
| Last Updated | 2026-02-13 00:00 GMT |
Description
TorchServe's Management API provides a complete model lifecycle management interface on port 8081 (by default). The API follows RESTful conventions with resources mapped to models and standard HTTP verbs mapped to lifecycle operations.
Lifecycle Operations
| Operation | HTTP Method | Endpoint | Description |
|---|---|---|---|
| Register | POST |
/models |
Load a model archive and create initial workers |
| Describe | GET |
/models/{model_name} |
Query model runtime status, worker count, and queue depth |
| Scale | PUT |
/models/{model_name} |
Adjust minimum and maximum worker count |
| Unregister | DELETE |
/models/{model_name} |
Remove a model and terminate its workers |
| List | GET |
/models |
List all registered models |
Registration Parameters
When registering a model, the following parameters control its serving behavior:
| Parameter | Default | Description |
|---|---|---|
url |
(required) | Path or URL to the .mar file
|
model_name |
From manifest | Name used in inference endpoint URLs |
initial_workers |
0 | Number of workers to create immediately |
batch_size |
1 | Inference batch size |
max_batch_delay |
100ms | Maximum wait time for batch aggregation |
response_timeout |
120s | Worker response timeout before reboot |
startup_timeout |
120s | Model load timeout before worker reboot |
synchronous |
false | Whether to wait for workers to be ready |
Worker Scaling
Worker scaling is the mechanism for adjusting the number of backend processes serving a model. Each worker:
- Loads its own copy of the model into memory.
- Handles requests independently (no shared state).
- Can be pinned to a specific GPU via
gpu_id.
Scaling decisions are based on:
- Throughput requirements: More workers serve more concurrent requests.
- Latency targets: Worker count affects queue depth and batch wait times.
- Resource constraints: Each worker consumes memory (CPU/GPU RAM).
- GPU utilization: Workers can be distributed across GPUs.
Synchronous vs Asynchronous Operations
Both registration and scaling support synchronous and asynchronous modes:
- Asynchronous (default): Returns HTTP 202 immediately; workers are created in the background. Suitable for automation scripts that poll for readiness.
- Synchronous: Blocks until all workers are online and returns HTTP 200. Suitable for integration tests and deployment scripts that need to proceed only when the model is ready.
Model Versioning
TorchServe supports multiple versions of the same model simultaneously. Scaling and unregistration can target specific versions:
PUT /models/{model_name}/{version}-- Scale a specific version.DELETE /models/{model_name}/{version}-- Unregister a specific version.GET /models/{model_name}/all-- Describe all versions.
Usage
Register a Model
# Register from a local .mar file in the model store
curl -X POST "http://localhost:8081/models?url=squeezenet1_1.mar&initial_workers=1&synchronous=true"
# Register from a remote URL
curl -X POST "http://localhost:8081/models?url=https://torchserve.pytorch.org/mar_files/squeezenet1_1.mar"
Scale Workers
# Scale to 3 workers synchronously
curl -X PUT "http://localhost:8081/models/squeezenet?min_worker=3&synchronous=true"
# Scale a specific version
curl -X PUT "http://localhost:8081/models/squeezenet/2.0?min_worker=5&synchronous=true"
Describe a Model
# Get model runtime status
curl "http://localhost:8081/models/squeezenet"
# Response includes:
# - modelName, modelVersion, modelUrl
# - minWorkers, maxWorkers, batchSize, maxBatchDelay
# - workers: [{id, startTime, status, gpu, memoryUsage}]
# - jobQueueStatus: {remainingCapacity, pendingRequests}
Unregister a Model
# Unregister a model and terminate all its workers
curl -X DELETE "http://localhost:8081/models/squeezenet"
Programmatic Registration
from ts.launcher import register_model, register_model_with_params
# Simple registration
response = register_model("squeezenet", "squeezenet1_1.mar")
# Registration with custom parameters
params = {
"model_name": "bert",
"url": "bert.mar",
"initial_workers": "4",
"batch_size": "16",
"max_batch_delay": "200",
"synchronous": "true",
}
response = register_model_with_params(params)
Theoretical Basis
Dynamic Service Discovery
Model registration is a form of Dynamic Service Discovery, where serving endpoints are created and removed at runtime. Each registered model creates a new inference endpoint at /predictions/{model_name}, analogous to dynamic route registration in microservice architectures.
Resource Pool Pattern
Worker scaling implements the Resource Pool (or Object Pool) pattern. Workers are pre-allocated resources that are pooled and reused across requests. Scaling the pool size up or down adjusts the system's capacity without restarting the server or affecting other models.
Control Plane / Data Plane Separation
The Management API (port 8081) serves as the Control Plane while the Inference API (port 8080) is the Data Plane. This separation allows:
- Different access control policies for management vs. inference.
- Independent scaling and rate limiting for each plane.
- Network isolation: management endpoints can be restricted to internal networks while inference endpoints face the public internet.
Eventual Consistency
Asynchronous operations follow the Eventual Consistency model. After an asynchronous registration or scaling request, the system is in a transitional state. Clients must poll the Describe API or wait for the synchronous variant to ensure the desired state has been reached.
Related Pages
- Implementation:Pytorch_Serve_Management_API - The Management API endpoints and Python helper functions
- Principle:Pytorch_Serve_Server_Lifecycle - Server must be running before models can be registered
- Principle:Pytorch_Serve_Model_Archiving - Models are registered from
.mararchives - Principle:Pytorch_Serve_Inference_Pipeline - Registered models become available for inference