Principle:Pytorch Serve Metrics Monitoring

Field	Value
Page Type	Principle
Domains	Monitoring, Infrastructure
Knowledge Sources	TorchServe
Workflow	LLM_Deployment_vLLM
Last Updated	2026-02-13 00:00 GMT

Overview

Production observability for model serving requires collecting system metrics (CPU, GPU, memory), model metrics (latency, throughput), and exposing them in a standardized format for monitoring infrastructure. TorchServe provides a dedicated metrics API that emits Prometheus-formatted metrics, enabling integration with Prometheus servers, Grafana dashboards, and alerting systems. For LLM deployments with vLLM, these metrics are essential for capacity planning, performance tuning, and operational reliability.

Description

The Three Pillars of Model Serving Observability

Model serving observability in TorchServe is organized into three categories of metrics:

1. System Metrics -- hardware resource utilization at the host level:

CPUUtilization (Percent) -- processor usage across all cores
MemoryUsed / MemoryAvailable (Megabytes) -- RAM consumption and headroom
MemoryUtilization (Percent) -- percentage of total RAM in use
DiskUsage / DiskAvailable (Gigabytes) -- storage consumption
DiskUtilization (Percent) -- percentage of disk capacity used
GPUUtilization (Percent) -- GPU compute utilization (requires pynvml)
GPUMemoryUtilization (Percent) -- GPU memory utilization
GPUMemoryUsed (Megabytes) -- absolute GPU memory consumption

These metrics are critical for LLM serving because large models consume significant GPU memory and compute resources. Monitoring GPU memory utilization helps prevent out-of-memory (OOM) errors, while GPU utilization indicates whether the model is effectively using available compute.

2. Model Metrics -- per-model performance measurements:

HandlerTime (ms) -- total time spent in the Python handler (preprocess + inference + postprocess). For the VLLMHandler, this includes the full async pipeline.
PredictionTime (ms) -- time for the inference step alone, including vLLM engine processing
QueueTime (ms) -- time a request spends waiting in the TorchServe queue before being dispatched to a worker

These metrics reveal the latency breakdown for each request. High QueueTime indicates the server needs more workers or the model is saturated. High HandlerTime relative to PredictionTime suggests preprocessing overhead.

3. Counter Metrics -- aggregate request statistics:

Requests2XX / Requests4XX / Requests5XX (Count) -- HTTP response code counters
ts_inference_requests_total (Count) -- total inference requests per model and version
ts_inference_latency_microseconds (Microseconds) -- cumulative inference latency counter
ts_queue_latency_microseconds (Microseconds) -- cumulative queue wait time counter
WorkerLoadTime (Milliseconds) -- time taken to load a model into a worker
WorkerThreadTime (Milliseconds) -- time spent in worker threads

Metrics Exposure Model

TorchServe uses a pull-based metrics model where a dedicated HTTP endpoint serves metrics in Prometheus text format. This aligns with the Prometheus ecosystem's scraping architecture:

The metrics endpoint listens on port 8082 by default
It is accessible only from localhost by default (configurable via config.properties)
Metrics are returned in Prometheus exposition format when metrics_mode is set to prometheus
The endpoint is enabled by default and can be disabled via enable_metrics_api=false

Metric Labels

Each metric includes contextual labels:

Level -- "Host" for system metrics, "Model" for model-specific metrics
Hostname -- the server's hostname for multi-host identification
ModelName -- the model name (for model-level metrics)
WorkerName -- the worker identifier (for worker-level metrics like WorkerLoadTime)
model_name / model_version -- for ts_inference counter metrics

Scaling Based on Metrics

TorchServe's management API supports dynamic scaling based on observed metrics:

PUT /models/{model_name} with min_worker and max_worker parameters adjusts the worker pool
Monitoring QueueTime and GPUUtilization enables data-driven scaling decisions
For vLLM workloads, scaling is typically done at the model instance level (adding replicas) rather than adding workers, since vLLM handles internal concurrency

Usage

Metrics monitoring is a continuous operational concern throughout the lifetime of a deployed model. The typical integration workflow:

Enable metrics in TorchServe configuration (enabled by default)
Configure Prometheus to scrape the :8082/metrics endpoint at regular intervals (e.g., 15 seconds)
Build Grafana dashboards to visualize latency distributions, throughput, and resource utilization
Set up alerts for critical thresholds (GPU memory > 90%, 5XX error rate > 1%, p99 latency > SLA)
Use metrics for capacity planning -- correlate request volume with GPU utilization to right-size the deployment

For LLM workloads specifically, key metrics to monitor include:

GPUMemoryUtilization -- vLLM's KV cache grows with concurrent sequences; approaching 100% triggers preemption
HandlerTime -- end-to-end latency including token generation; long-tail distribution is expected for autoregressive models
ts_inference_requests_total -- throughput tracking for SLA monitoring

Theoretical Basis

Prometheus Pull Model

The pull-based metrics model (where a central server scrapes endpoints) has several advantages over push-based approaches for model serving:

Decoupled collection -- the model server does not need to know about the monitoring infrastructure
No backpressure -- if the monitoring system is slow, it simply skips a scrape interval without affecting the serving path
Service discovery -- Prometheus can automatically discover and scrape new model server instances

RED Method

TorchServe's metrics align with the RED method (Rate, Errors, Duration) for monitoring request-driven services:

Rate -- ts_inference_requests_total provides the request rate
Errors -- Requests4XX and Requests5XX track error rates
Duration -- ts_inference_latency_microseconds, HandlerTime, and PredictionTime measure latency

USE Method for System Resources

For system-level metrics, TorchServe follows the USE method (Utilization, Saturation, Errors):

Utilization -- CPUUtilization, GPUUtilization, MemoryUtilization, DiskUtilization
Saturation -- QueueTime indicates when the system is saturated (requests queuing)
Errors -- Requests5XX captures system-level failures

The combination of RED and USE methods provides comprehensive observability across both the application layer (model inference) and the infrastructure layer (compute, memory, storage).

Related Pages

Implementation:Pytorch_Serve_Metrics_API -- the concrete HTTP API for retrieving Prometheus-formatted metrics from TorchServe

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment