Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Pytorch Serve Metrics API

From Leeroopedia
Revision as of 13:46, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Pytorch_Serve_Metrics_API.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Field Value
Page Type Implementation
Implementation Type External Tool Doc
Domains Monitoring, Infrastructure
Knowledge Sources TorchServe
Workflow LLM_Deployment_vLLM
Last Updated 2026-02-13 00:00 GMT

Overview

The TorchServe Metrics API is an HTTP endpoint that exposes system, model, and counter metrics in Prometheus exposition format. It listens on port 8082 by default, is accessible from localhost only, and returns metrics when the metrics_mode configuration is set to prometheus. This API enables integration with Prometheus servers and Grafana dashboards for production monitoring of LLM deployments.

Description

The Metrics API provides a single HTTP GET endpoint that returns all collected metrics in Prometheus text format. The metrics include host-level system resource utilization, per-model inference performance measurements, and aggregate request counters. The API also supports filtered queries to retrieve specific metric names.

Key characteristics:

  • Default port: 8082
  • Default access: localhost only (configurable via config.properties)
  • Format: Prometheus exposition format (text/plain)
  • Enabled by default: yes (disable with enable_metrics_api=false)
  • Metrics mode: must be set to prometheus for Prometheus-format output

Usage

Querying Metrics

# Retrieve all metrics
curl http://127.0.0.1:8082/metrics

# Retrieve specific metrics by name
curl "http://127.0.0.1:8082/metrics?name[]=ts_inference_latency_microseconds&name[]=ts_queue_latency_microseconds" --globoff

Scaling Models via Management API

# Scale model workers based on observed metrics
curl -X PUT "http://localhost:8081/models/{model_name}" \
    -d "min_worker=2&max_worker=4"

Code Reference

Source Location

File Lines Description
docs/metrics_api.md L1-118 Metrics API documentation and Prometheus integration guide

Signature

GET http://localhost:8082/metrics
    Returns: Prometheus-formatted metrics (text/plain)

GET http://localhost:8082/metrics?name[]=<metric_name>&name[]=<metric_name>
    Returns: Filtered Prometheus-formatted metrics for specified names

PUT http://localhost:8081/models/{model_name}
    Parameters: min_worker (int), max_worker (int)
    Returns: Model scaling confirmation

Import

# The Metrics API is a built-in TorchServe HTTP endpoint.
# No Python import is required -- it is enabled via configuration:

# In config.properties:
metrics_mode=prometheus
enable_metrics_api=true
metrics_address=http://0.0.0.0:8082

I/O Contract

Direction Type Description
Input HTTP GET GET /metrics with optional name[] query parameters
Output text/plain Prometheus exposition format with TYPE, HELP, and metric lines
Configuration config.properties metrics_mode=prometheus, enable_metrics_api=true, metrics_address
Port int Default 8082
Access Network localhost only by default; configurable via metrics_address

System Metrics

Metric Name Type Unit Description
CPUUtilization gauge Percent Host CPU utilization
MemoryUsed gauge Megabytes Host RAM in use
MemoryAvailable gauge Megabytes Host RAM available
MemoryUtilization gauge Percent Host RAM usage percentage
DiskUsage gauge Gigabytes Disk space used
DiskAvailable gauge Gigabytes Disk space available
DiskUtilization gauge Percent Disk usage percentage
GPUUtilization gauge Percent GPU compute utilization
GPUMemoryUtilization gauge Percent GPU memory utilization
GPUMemoryUsed gauge Megabytes GPU memory in use

Model Metrics

Metric Name Type Unit Description
HandlerTime gauge ms Total handler execution time (preprocess + inference + postprocess)
PredictionTime gauge ms Inference-only execution time
QueueTime gauge Milliseconds Time request waited in queue
WorkerLoadTime gauge Milliseconds Time to load model into worker
WorkerThreadTime gauge Milliseconds Time spent in worker thread

Counter Metrics

Metric Name Type Unit Description
Requests2XX counter Count Successful requests
Requests4XX counter Count Client error requests
Requests5XX counter Count Server error requests
ts_inference_requests_total counter Count Total inference requests per model/version
ts_inference_latency_microseconds counter Microseconds Cumulative inference latency
ts_queue_latency_microseconds counter Microseconds Cumulative queue latency

Usage Examples

Example 1: Full Metrics Output

curl http://127.0.0.1:8082/metrics

Returns output in the following format:

# HELP Requests5XX Torchserve prometheus counter metric with unit: Count
# TYPE Requests5XX counter
# HELP DiskUsage Torchserve prometheus gauge metric with unit: Gigabytes
# TYPE DiskUsage gauge
DiskUsage{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 20.054508209228516
# HELP GPUUtilization Torchserve prometheus gauge metric with unit: Percent
# TYPE GPUUtilization gauge
# HELP PredictionTime Torchserve prometheus gauge metric with unit: ms
# TYPE PredictionTime gauge
PredictionTime{ModelName="resnet18",Level="Model",Hostname="88665a372f4b.ant.amazon.com",} 83.13
# HELP MemoryAvailable Torchserve prometheus gauge metric with unit: Megabytes
# TYPE MemoryAvailable gauge
MemoryAvailable{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 5829.7421875
# HELP ts_inference_requests_total Torchserve prometheus counter metric with unit: Count
# TYPE ts_inference_requests_total counter
ts_inference_requests_total{model_name="resnet18",model_version="default",hostname="88665a372f4b.ant.amazon.com",} 3.0
# HELP HandlerTime Torchserve prometheus gauge metric with unit: ms
# TYPE HandlerTime gauge
HandlerTime{ModelName="resnet18",Level="Model",Hostname="88665a372f4b.ant.amazon.com",} 82.93
# HELP ts_inference_latency_microseconds Torchserve prometheus counter metric with unit: Microseconds
# TYPE ts_inference_latency_microseconds counter
ts_inference_latency_microseconds{model_name="resnet18",model_version="default",hostname="88665a372f4b.ant.amazon.com",} 290371.129
# HELP CPUUtilization Torchserve prometheus gauge metric with unit: Percent
# TYPE CPUUtilization gauge
CPUUtilization{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 0.0
# HELP MemoryUsed Torchserve prometheus gauge metric with unit: Megabytes
# TYPE MemoryUsed gauge
MemoryUsed{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 8245.62109375
# HELP QueueTime Torchserve prometheus gauge metric with unit: Milliseconds
# TYPE QueueTime gauge
QueueTime{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 0.0
# HELP Requests2XX Torchserve prometheus counter metric with unit: Count
# TYPE Requests2XX counter
Requests2XX{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 8.0

Example 2: Filtered Metrics Query

curl "http://127.0.0.1:8082/metrics?name[]=ts_inference_latency_microseconds&name[]=ts_queue_latency_microseconds" --globoff

Returns only the specified metrics:

# HELP ts_queue_latency_microseconds Torchserve prometheus counter metric with unit: Microseconds
# TYPE ts_queue_latency_microseconds counter
ts_queue_latency_microseconds{model_name="resnet18",model_version="default",hostname="88665a372f4b.ant.amazon.com",} 365.21
# HELP ts_inference_latency_microseconds Torchserve prometheus counter metric with unit: Microseconds
# TYPE ts_inference_latency_microseconds counter
ts_inference_latency_microseconds{model_name="resnet18",model_version="default",hostname="88665a372f4b.ant.amazon.com",} 290371.129

Example 3: Prometheus Server Configuration

Create a prometheus.yml configuration file to scrape TorchServe metrics:

global:
  scrape_interval:     15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']
  - job_name: 'torchserve'
    static_configs:
    - targets: ['localhost:8082']  # TorchServe metrics endpoint

Start Prometheus with this configuration:

./prometheus --config.file=prometheus.yml

Navigate to http://localhost:9090/ to execute queries and create graphs.

Example 4: Grafana Integration

After configuring Prometheus, set up Grafana for dashboard visualization:

# Start Grafana
sudo systemctl daemon-reload && sudo systemctl enable grafana-server && sudo systemctl start grafana-server

Navigate to http://localhost:3000/, add the Prometheus data source pointing to http://localhost:9090, and build dashboards for:

  • GPU Memory Dashboard -- track GPUMemoryUtilization and GPUMemoryUsed over time
  • Latency Dashboard -- plot HandlerTime, PredictionTime, and QueueTime as time series
  • Throughput Dashboard -- rate of ts_inference_requests_total per model
  • Error Rate Dashboard -- ratio of Requests5XX to total requests

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment