Principle:Pytorch Serve Model Registration

Overview

Model Registration is the principle of dynamic model lifecycle management through a REST API in TorchServe. It enables registering models on a running server, scaling worker processes up or down, querying model status, and unregistering models -- all without restarting the server. This dynamic management capability is essential for production environments where models must be updated, scaled, and retired in response to changing traffic patterns and business requirements.

Field	Value
Principle Name	Model Registration
Workflow	Model_Deployment
Domains	Model_Serving, API_Design
Knowledge Sources	TorchServe
Last Updated	2026-02-13 00:00 GMT

Description

TorchServe's Management API provides a complete model lifecycle management interface on port 8081 (by default). The API follows RESTful conventions with resources mapped to models and standard HTTP verbs mapped to lifecycle operations.

Lifecycle Operations

Operation	HTTP Method	Endpoint	Description
Register	`POST`	`/models`	Load a model archive and create initial workers
Describe	`GET`	`/models/{model_name}`	Query model runtime status, worker count, and queue depth
Scale	`PUT`	`/models/{model_name}`	Adjust minimum and maximum worker count
Unregister	`DELETE`	`/models/{model_name}`	Remove a model and terminate its workers
List	`GET`	`/models`	List all registered models

Registration Parameters

When registering a model, the following parameters control its serving behavior:

Parameter	Default	Description
`url`	(required)	Path or URL to the `.mar` file
`model_name`	From manifest	Name used in inference endpoint URLs
`initial_workers`	0	Number of workers to create immediately
`batch_size`	1	Inference batch size
`max_batch_delay`	100ms	Maximum wait time for batch aggregation
`response_timeout`	120s	Worker response timeout before reboot
`startup_timeout`	120s	Model load timeout before worker reboot
`synchronous`	false	Whether to wait for workers to be ready

Worker Scaling

Worker scaling is the mechanism for adjusting the number of backend processes serving a model. Each worker:

Loads its own copy of the model into memory.
Handles requests independently (no shared state).
Can be pinned to a specific GPU via gpu_id.

Scaling decisions are based on:

Throughput requirements: More workers serve more concurrent requests.
Latency targets: Worker count affects queue depth and batch wait times.
Resource constraints: Each worker consumes memory (CPU/GPU RAM).
GPU utilization: Workers can be distributed across GPUs.

Synchronous vs Asynchronous Operations

Both registration and scaling support synchronous and asynchronous modes:

Asynchronous (default): Returns HTTP 202 immediately; workers are created in the background. Suitable for automation scripts that poll for readiness.
Synchronous: Blocks until all workers are online and returns HTTP 200. Suitable for integration tests and deployment scripts that need to proceed only when the model is ready.

Model Versioning

TorchServe supports multiple versions of the same model simultaneously. Scaling and unregistration can target specific versions:

PUT /models/{model_name}/{version} -- Scale a specific version.
DELETE /models/{model_name}/{version} -- Unregister a specific version.
GET /models/{model_name}/all -- Describe all versions.

Usage

Register a Model

# Register from a local .mar file in the model store
curl -X POST "http://localhost:8081/models?url=squeezenet1_1.mar&initial_workers=1&synchronous=true"

# Register from a remote URL
curl -X POST "http://localhost:8081/models?url=https://torchserve.pytorch.org/mar_files/squeezenet1_1.mar"

Scale Workers

# Scale to 3 workers synchronously
curl -X PUT "http://localhost:8081/models/squeezenet?min_worker=3&synchronous=true"

# Scale a specific version
curl -X PUT "http://localhost:8081/models/squeezenet/2.0?min_worker=5&synchronous=true"

Describe a Model

# Get model runtime status
curl "http://localhost:8081/models/squeezenet"

# Response includes:
# - modelName, modelVersion, modelUrl
# - minWorkers, maxWorkers, batchSize, maxBatchDelay
# - workers: [{id, startTime, status, gpu, memoryUsage}]
# - jobQueueStatus: {remainingCapacity, pendingRequests}

Unregister a Model

# Unregister a model and terminate all its workers
curl -X DELETE "http://localhost:8081/models/squeezenet"

Programmatic Registration

from ts.launcher import register_model, register_model_with_params

# Simple registration
response = register_model("squeezenet", "squeezenet1_1.mar")

# Registration with custom parameters
params = {
    "model_name": "bert",
    "url": "bert.mar",
    "initial_workers": "4",
    "batch_size": "16",
    "max_batch_delay": "200",
    "synchronous": "true",
}
response = register_model_with_params(params)

Theoretical Basis

Dynamic Service Discovery

Model registration is a form of Dynamic Service Discovery, where serving endpoints are created and removed at runtime. Each registered model creates a new inference endpoint at /predictions/{model_name}, analogous to dynamic route registration in microservice architectures.

Resource Pool Pattern

Worker scaling implements the Resource Pool (or Object Pool) pattern. Workers are pre-allocated resources that are pooled and reused across requests. Scaling the pool size up or down adjusts the system's capacity without restarting the server or affecting other models.

Control Plane / Data Plane Separation

The Management API (port 8081) serves as the Control Plane while the Inference API (port 8080) is the Data Plane. This separation allows:

Different access control policies for management vs. inference.
Independent scaling and rate limiting for each plane.
Network isolation: management endpoints can be restricted to internal networks while inference endpoints face the public internet.

Eventual Consistency

Asynchronous operations follow the Eventual Consistency model. After an asynchronous registration or scaling request, the system is in a transitional state. Clients must poll the Describe API or wait for the synchronous variant to ensure the desired state has been reached.

Related Pages

Implementation:Pytorch_Serve_Management_API - The Management API endpoints and Python helper functions
Principle:Pytorch_Serve_Server_Lifecycle - Server must be running before models can be registered
Principle:Pytorch_Serve_Model_Archiving - Models are registered from .mar archives
Principle:Pytorch_Serve_Inference_Pipeline - Registered models become available for inference

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment