Implementation:Pytorch Serve Metrics API

Field	Value
Page Type	Implementation
Implementation Type	External Tool Doc
Domains	Monitoring, Infrastructure
Knowledge Sources	TorchServe
Workflow	LLM_Deployment_vLLM
Last Updated	2026-02-13 00:00 GMT

Overview

The TorchServe Metrics API is an HTTP endpoint that exposes system, model, and counter metrics in Prometheus exposition format. It listens on port 8082 by default, is accessible from localhost only, and returns metrics when the metrics_mode configuration is set to prometheus. This API enables integration with Prometheus servers and Grafana dashboards for production monitoring of LLM deployments.

Description

The Metrics API provides a single HTTP GET endpoint that returns all collected metrics in Prometheus text format. The metrics include host-level system resource utilization, per-model inference performance measurements, and aggregate request counters. The API also supports filtered queries to retrieve specific metric names.

Key characteristics:

Default port: 8082
Default access: localhost only (configurable via config.properties)
Format: Prometheus exposition format (text/plain)
Enabled by default: yes (disable with enable_metrics_api=false)
Metrics mode: must be set to prometheus for Prometheus-format output

Usage

Querying Metrics

# Retrieve all metrics
curl http://127.0.0.1:8082/metrics

# Retrieve specific metrics by name
curl "http://127.0.0.1:8082/metrics?name[]=ts_inference_latency_microseconds&name[]=ts_queue_latency_microseconds" --globoff

Scaling Models via Management API

# Scale model workers based on observed metrics
curl -X PUT "http://localhost:8081/models/{model_name}" \
    -d "min_worker=2&max_worker=4"

Code Reference

Source Location

File	Lines	Description
`docs/metrics_api.md`	L1-118	Metrics API documentation and Prometheus integration guide

Signature

GET http://localhost:8082/metrics
    Returns: Prometheus-formatted metrics (text/plain)

GET http://localhost:8082/metrics?name[]=<metric_name>&name[]=<metric_name>
    Returns: Filtered Prometheus-formatted metrics for specified names

PUT http://localhost:8081/models/{model_name}
    Parameters: min_worker (int), max_worker (int)
    Returns: Model scaling confirmation

Import

# The Metrics API is a built-in TorchServe HTTP endpoint.
# No Python import is required -- it is enabled via configuration:

# In config.properties:
metrics_mode=prometheus
enable_metrics_api=true
metrics_address=http://0.0.0.0:8082

I/O Contract

Direction	Type	Description
Input	HTTP GET	`GET /metrics` with optional `name[]` query parameters
Output	text/plain	Prometheus exposition format with TYPE, HELP, and metric lines
Configuration	config.properties	`metrics_mode=prometheus`, `enable_metrics_api=true`, `metrics_address`
Port	int	Default 8082
Access	Network	localhost only by default; configurable via `metrics_address`

System Metrics

Metric Name	Type	Unit	Description
`CPUUtilization`	gauge	Percent	Host CPU utilization
`MemoryUsed`	gauge	Megabytes	Host RAM in use
`MemoryAvailable`	gauge	Megabytes	Host RAM available
`MemoryUtilization`	gauge	Percent	Host RAM usage percentage
`DiskUsage`	gauge	Gigabytes	Disk space used
`DiskAvailable`	gauge	Gigabytes	Disk space available
`DiskUtilization`	gauge	Percent	Disk usage percentage
`GPUUtilization`	gauge	Percent	GPU compute utilization
`GPUMemoryUtilization`	gauge	Percent	GPU memory utilization
`GPUMemoryUsed`	gauge	Megabytes	GPU memory in use

Model Metrics

Metric Name	Type	Unit	Description
`HandlerTime`	gauge	ms	Total handler execution time (preprocess + inference + postprocess)
`PredictionTime`	gauge	ms	Inference-only execution time
`QueueTime`	gauge	Milliseconds	Time request waited in queue
`WorkerLoadTime`	gauge	Milliseconds	Time to load model into worker
`WorkerThreadTime`	gauge	Milliseconds	Time spent in worker thread

Counter Metrics

Metric Name	Type	Unit	Description
`Requests2XX`	counter	Count	Successful requests
`Requests4XX`	counter	Count	Client error requests
`Requests5XX`	counter	Count	Server error requests
`ts_inference_requests_total`	counter	Count	Total inference requests per model/version
`ts_inference_latency_microseconds`	counter	Microseconds	Cumulative inference latency
`ts_queue_latency_microseconds`	counter	Microseconds	Cumulative queue latency

Usage Examples

Example 1: Full Metrics Output

curl http://127.0.0.1:8082/metrics

Returns output in the following format:

# HELP Requests5XX Torchserve prometheus counter metric with unit: Count
# TYPE Requests5XX counter
# HELP DiskUsage Torchserve prometheus gauge metric with unit: Gigabytes
# TYPE DiskUsage gauge
DiskUsage{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 20.054508209228516
# HELP GPUUtilization Torchserve prometheus gauge metric with unit: Percent
# TYPE GPUUtilization gauge
# HELP PredictionTime Torchserve prometheus gauge metric with unit: ms
# TYPE PredictionTime gauge
PredictionTime{ModelName="resnet18",Level="Model",Hostname="88665a372f4b.ant.amazon.com",} 83.13
# HELP MemoryAvailable Torchserve prometheus gauge metric with unit: Megabytes
# TYPE MemoryAvailable gauge
MemoryAvailable{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 5829.7421875
# HELP ts_inference_requests_total Torchserve prometheus counter metric with unit: Count
# TYPE ts_inference_requests_total counter
ts_inference_requests_total{model_name="resnet18",model_version="default",hostname="88665a372f4b.ant.amazon.com",} 3.0
# HELP HandlerTime Torchserve prometheus gauge metric with unit: ms
# TYPE HandlerTime gauge
HandlerTime{ModelName="resnet18",Level="Model",Hostname="88665a372f4b.ant.amazon.com",} 82.93
# HELP ts_inference_latency_microseconds Torchserve prometheus counter metric with unit: Microseconds
# TYPE ts_inference_latency_microseconds counter
ts_inference_latency_microseconds{model_name="resnet18",model_version="default",hostname="88665a372f4b.ant.amazon.com",} 290371.129
# HELP CPUUtilization Torchserve prometheus gauge metric with unit: Percent
# TYPE CPUUtilization gauge
CPUUtilization{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 0.0
# HELP MemoryUsed Torchserve prometheus gauge metric with unit: Megabytes
# TYPE MemoryUsed gauge
MemoryUsed{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 8245.62109375
# HELP QueueTime Torchserve prometheus gauge metric with unit: Milliseconds
# TYPE QueueTime gauge
QueueTime{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 0.0
# HELP Requests2XX Torchserve prometheus counter metric with unit: Count
# TYPE Requests2XX counter
Requests2XX{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 8.0

Example 2: Filtered Metrics Query

curl "http://127.0.0.1:8082/metrics?name[]=ts_inference_latency_microseconds&name[]=ts_queue_latency_microseconds" --globoff

Returns only the specified metrics:

# HELP ts_queue_latency_microseconds Torchserve prometheus counter metric with unit: Microseconds
# TYPE ts_queue_latency_microseconds counter
ts_queue_latency_microseconds{model_name="resnet18",model_version="default",hostname="88665a372f4b.ant.amazon.com",} 365.21
# HELP ts_inference_latency_microseconds Torchserve prometheus counter metric with unit: Microseconds
# TYPE ts_inference_latency_microseconds counter
ts_inference_latency_microseconds{model_name="resnet18",model_version="default",hostname="88665a372f4b.ant.amazon.com",} 290371.129

Example 3: Prometheus Server Configuration

Create a prometheus.yml configuration file to scrape TorchServe metrics:

global:
  scrape_interval:     15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']
  - job_name: 'torchserve'
    static_configs:
    - targets: ['localhost:8082']  # TorchServe metrics endpoint

Start Prometheus with this configuration:

./prometheus --config.file=prometheus.yml

Navigate to http://localhost:9090/ to execute queries and create graphs.

Example 4: Grafana Integration

After configuring Prometheus, set up Grafana for dashboard visualization:

# Start Grafana
sudo systemctl daemon-reload && sudo systemctl enable grafana-server && sudo systemctl start grafana-server

Navigate to http://localhost:3000/, add the Prometheus data source pointing to http://localhost:9090, and build dashboards for:

GPU Memory Dashboard -- track GPUMemoryUtilization and GPUMemoryUsed over time
Latency Dashboard -- plot HandlerTime, PredictionTime, and QueueTime as time series
Throughput Dashboard -- rate of ts_inference_requests_total per model
Error Rate Dashboard -- ratio of Requests5XX to total requests

Related Pages

Principle:Pytorch_Serve_Metrics_Monitoring -- the theoretical basis for production observability in model serving systems

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment