Implementation:Pytorch Serve Metrics API
| Field | Value |
|---|---|
| Page Type | Implementation |
| Implementation Type | External Tool Doc |
| Domains | Monitoring, Infrastructure |
| Knowledge Sources | TorchServe |
| Workflow | LLM_Deployment_vLLM |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
The TorchServe Metrics API is an HTTP endpoint that exposes system, model, and counter metrics in Prometheus exposition format. It listens on port 8082 by default, is accessible from localhost only, and returns metrics when the metrics_mode configuration is set to prometheus. This API enables integration with Prometheus servers and Grafana dashboards for production monitoring of LLM deployments.
Description
The Metrics API provides a single HTTP GET endpoint that returns all collected metrics in Prometheus text format. The metrics include host-level system resource utilization, per-model inference performance measurements, and aggregate request counters. The API also supports filtered queries to retrieve specific metric names.
Key characteristics:
- Default port: 8082
- Default access: localhost only (configurable via
config.properties) - Format: Prometheus exposition format (text/plain)
- Enabled by default: yes (disable with
enable_metrics_api=false) - Metrics mode: must be set to
prometheusfor Prometheus-format output
Usage
Querying Metrics
# Retrieve all metrics
curl http://127.0.0.1:8082/metrics
# Retrieve specific metrics by name
curl "http://127.0.0.1:8082/metrics?name[]=ts_inference_latency_microseconds&name[]=ts_queue_latency_microseconds" --globoff
Scaling Models via Management API
# Scale model workers based on observed metrics
curl -X PUT "http://localhost:8081/models/{model_name}" \
-d "min_worker=2&max_worker=4"
Code Reference
Source Location
| File | Lines | Description |
|---|---|---|
docs/metrics_api.md |
L1-118 | Metrics API documentation and Prometheus integration guide |
Signature
GET http://localhost:8082/metrics
Returns: Prometheus-formatted metrics (text/plain)
GET http://localhost:8082/metrics?name[]=<metric_name>&name[]=<metric_name>
Returns: Filtered Prometheus-formatted metrics for specified names
PUT http://localhost:8081/models/{model_name}
Parameters: min_worker (int), max_worker (int)
Returns: Model scaling confirmation
Import
# The Metrics API is a built-in TorchServe HTTP endpoint.
# No Python import is required -- it is enabled via configuration:
# In config.properties:
metrics_mode=prometheus
enable_metrics_api=true
metrics_address=http://0.0.0.0:8082
I/O Contract
| Direction | Type | Description |
|---|---|---|
| Input | HTTP GET | GET /metrics with optional name[] query parameters
|
| Output | text/plain | Prometheus exposition format with TYPE, HELP, and metric lines |
| Configuration | config.properties | metrics_mode=prometheus, enable_metrics_api=true, metrics_address
|
| Port | int | Default 8082 |
| Access | Network | localhost only by default; configurable via metrics_address
|
System Metrics
| Metric Name | Type | Unit | Description |
|---|---|---|---|
CPUUtilization |
gauge | Percent | Host CPU utilization |
MemoryUsed |
gauge | Megabytes | Host RAM in use |
MemoryAvailable |
gauge | Megabytes | Host RAM available |
MemoryUtilization |
gauge | Percent | Host RAM usage percentage |
DiskUsage |
gauge | Gigabytes | Disk space used |
DiskAvailable |
gauge | Gigabytes | Disk space available |
DiskUtilization |
gauge | Percent | Disk usage percentage |
GPUUtilization |
gauge | Percent | GPU compute utilization |
GPUMemoryUtilization |
gauge | Percent | GPU memory utilization |
GPUMemoryUsed |
gauge | Megabytes | GPU memory in use |
Model Metrics
| Metric Name | Type | Unit | Description |
|---|---|---|---|
HandlerTime |
gauge | ms | Total handler execution time (preprocess + inference + postprocess) |
PredictionTime |
gauge | ms | Inference-only execution time |
QueueTime |
gauge | Milliseconds | Time request waited in queue |
WorkerLoadTime |
gauge | Milliseconds | Time to load model into worker |
WorkerThreadTime |
gauge | Milliseconds | Time spent in worker thread |
Counter Metrics
| Metric Name | Type | Unit | Description |
|---|---|---|---|
Requests2XX |
counter | Count | Successful requests |
Requests4XX |
counter | Count | Client error requests |
Requests5XX |
counter | Count | Server error requests |
ts_inference_requests_total |
counter | Count | Total inference requests per model/version |
ts_inference_latency_microseconds |
counter | Microseconds | Cumulative inference latency |
ts_queue_latency_microseconds |
counter | Microseconds | Cumulative queue latency |
Usage Examples
Example 1: Full Metrics Output
curl http://127.0.0.1:8082/metrics
Returns output in the following format:
# HELP Requests5XX Torchserve prometheus counter metric with unit: Count
# TYPE Requests5XX counter
# HELP DiskUsage Torchserve prometheus gauge metric with unit: Gigabytes
# TYPE DiskUsage gauge
DiskUsage{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 20.054508209228516
# HELP GPUUtilization Torchserve prometheus gauge metric with unit: Percent
# TYPE GPUUtilization gauge
# HELP PredictionTime Torchserve prometheus gauge metric with unit: ms
# TYPE PredictionTime gauge
PredictionTime{ModelName="resnet18",Level="Model",Hostname="88665a372f4b.ant.amazon.com",} 83.13
# HELP MemoryAvailable Torchserve prometheus gauge metric with unit: Megabytes
# TYPE MemoryAvailable gauge
MemoryAvailable{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 5829.7421875
# HELP ts_inference_requests_total Torchserve prometheus counter metric with unit: Count
# TYPE ts_inference_requests_total counter
ts_inference_requests_total{model_name="resnet18",model_version="default",hostname="88665a372f4b.ant.amazon.com",} 3.0
# HELP HandlerTime Torchserve prometheus gauge metric with unit: ms
# TYPE HandlerTime gauge
HandlerTime{ModelName="resnet18",Level="Model",Hostname="88665a372f4b.ant.amazon.com",} 82.93
# HELP ts_inference_latency_microseconds Torchserve prometheus counter metric with unit: Microseconds
# TYPE ts_inference_latency_microseconds counter
ts_inference_latency_microseconds{model_name="resnet18",model_version="default",hostname="88665a372f4b.ant.amazon.com",} 290371.129
# HELP CPUUtilization Torchserve prometheus gauge metric with unit: Percent
# TYPE CPUUtilization gauge
CPUUtilization{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 0.0
# HELP MemoryUsed Torchserve prometheus gauge metric with unit: Megabytes
# TYPE MemoryUsed gauge
MemoryUsed{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 8245.62109375
# HELP QueueTime Torchserve prometheus gauge metric with unit: Milliseconds
# TYPE QueueTime gauge
QueueTime{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 0.0
# HELP Requests2XX Torchserve prometheus counter metric with unit: Count
# TYPE Requests2XX counter
Requests2XX{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 8.0
Example 2: Filtered Metrics Query
curl "http://127.0.0.1:8082/metrics?name[]=ts_inference_latency_microseconds&name[]=ts_queue_latency_microseconds" --globoff
Returns only the specified metrics:
# HELP ts_queue_latency_microseconds Torchserve prometheus counter metric with unit: Microseconds
# TYPE ts_queue_latency_microseconds counter
ts_queue_latency_microseconds{model_name="resnet18",model_version="default",hostname="88665a372f4b.ant.amazon.com",} 365.21
# HELP ts_inference_latency_microseconds Torchserve prometheus counter metric with unit: Microseconds
# TYPE ts_inference_latency_microseconds counter
ts_inference_latency_microseconds{model_name="resnet18",model_version="default",hostname="88665a372f4b.ant.amazon.com",} 290371.129
Example 3: Prometheus Server Configuration
Create a prometheus.yml configuration file to scrape TorchServe metrics:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'torchserve'
static_configs:
- targets: ['localhost:8082'] # TorchServe metrics endpoint
Start Prometheus with this configuration:
./prometheus --config.file=prometheus.yml
Navigate to http://localhost:9090/ to execute queries and create graphs.
Example 4: Grafana Integration
After configuring Prometheus, set up Grafana for dashboard visualization:
# Start Grafana
sudo systemctl daemon-reload && sudo systemctl enable grafana-server && sudo systemctl start grafana-server
Navigate to http://localhost:3000/, add the Prometheus data source pointing to http://localhost:9090, and build dashboards for:
- GPU Memory Dashboard -- track
GPUMemoryUtilizationandGPUMemoryUsedover time - Latency Dashboard -- plot
HandlerTime,PredictionTime, andQueueTimeas time series - Throughput Dashboard -- rate of
ts_inference_requests_totalper model - Error Rate Dashboard -- ratio of
Requests5XXto total requests
Related Pages
- Principle:Pytorch_Serve_Metrics_Monitoring -- the theoretical basis for production observability in model serving systems