Principle:Pytorch Serve Metrics Monitoring
| Field | Value |
|---|---|
| Page Type | Principle |
| Domains | Monitoring, Infrastructure |
| Knowledge Sources | TorchServe |
| Workflow | LLM_Deployment_vLLM |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Production observability for model serving requires collecting system metrics (CPU, GPU, memory), model metrics (latency, throughput), and exposing them in a standardized format for monitoring infrastructure. TorchServe provides a dedicated metrics API that emits Prometheus-formatted metrics, enabling integration with Prometheus servers, Grafana dashboards, and alerting systems. For LLM deployments with vLLM, these metrics are essential for capacity planning, performance tuning, and operational reliability.
Description
The Three Pillars of Model Serving Observability
Model serving observability in TorchServe is organized into three categories of metrics:
1. System Metrics -- hardware resource utilization at the host level:
- CPUUtilization (Percent) -- processor usage across all cores
- MemoryUsed / MemoryAvailable (Megabytes) -- RAM consumption and headroom
- MemoryUtilization (Percent) -- percentage of total RAM in use
- DiskUsage / DiskAvailable (Gigabytes) -- storage consumption
- DiskUtilization (Percent) -- percentage of disk capacity used
- GPUUtilization (Percent) -- GPU compute utilization (requires pynvml)
- GPUMemoryUtilization (Percent) -- GPU memory utilization
- GPUMemoryUsed (Megabytes) -- absolute GPU memory consumption
These metrics are critical for LLM serving because large models consume significant GPU memory and compute resources. Monitoring GPU memory utilization helps prevent out-of-memory (OOM) errors, while GPU utilization indicates whether the model is effectively using available compute.
2. Model Metrics -- per-model performance measurements:
- HandlerTime (ms) -- total time spent in the Python handler (preprocess + inference + postprocess). For the VLLMHandler, this includes the full async pipeline.
- PredictionTime (ms) -- time for the inference step alone, including vLLM engine processing
- QueueTime (ms) -- time a request spends waiting in the TorchServe queue before being dispatched to a worker
These metrics reveal the latency breakdown for each request. High QueueTime indicates the server needs more workers or the model is saturated. High HandlerTime relative to PredictionTime suggests preprocessing overhead.
3. Counter Metrics -- aggregate request statistics:
- Requests2XX / Requests4XX / Requests5XX (Count) -- HTTP response code counters
- ts_inference_requests_total (Count) -- total inference requests per model and version
- ts_inference_latency_microseconds (Microseconds) -- cumulative inference latency counter
- ts_queue_latency_microseconds (Microseconds) -- cumulative queue wait time counter
- WorkerLoadTime (Milliseconds) -- time taken to load a model into a worker
- WorkerThreadTime (Milliseconds) -- time spent in worker threads
Metrics Exposure Model
TorchServe uses a pull-based metrics model where a dedicated HTTP endpoint serves metrics in Prometheus text format. This aligns with the Prometheus ecosystem's scraping architecture:
- The metrics endpoint listens on port 8082 by default
- It is accessible only from localhost by default (configurable via
config.properties) - Metrics are returned in Prometheus exposition format when
metrics_modeis set toprometheus - The endpoint is enabled by default and can be disabled via
enable_metrics_api=false
Metric Labels
Each metric includes contextual labels:
- Level -- "Host" for system metrics, "Model" for model-specific metrics
- Hostname -- the server's hostname for multi-host identification
- ModelName -- the model name (for model-level metrics)
- WorkerName -- the worker identifier (for worker-level metrics like WorkerLoadTime)
- model_name / model_version -- for ts_inference counter metrics
Scaling Based on Metrics
TorchServe's management API supports dynamic scaling based on observed metrics:
PUT /models/{model_name}withmin_workerandmax_workerparameters adjusts the worker pool- Monitoring QueueTime and GPUUtilization enables data-driven scaling decisions
- For vLLM workloads, scaling is typically done at the model instance level (adding replicas) rather than adding workers, since vLLM handles internal concurrency
Usage
Metrics monitoring is a continuous operational concern throughout the lifetime of a deployed model. The typical integration workflow:
- Enable metrics in TorchServe configuration (enabled by default)
- Configure Prometheus to scrape the
:8082/metricsendpoint at regular intervals (e.g., 15 seconds) - Build Grafana dashboards to visualize latency distributions, throughput, and resource utilization
- Set up alerts for critical thresholds (GPU memory > 90%, 5XX error rate > 1%, p99 latency > SLA)
- Use metrics for capacity planning -- correlate request volume with GPU utilization to right-size the deployment
For LLM workloads specifically, key metrics to monitor include:
- GPUMemoryUtilization -- vLLM's KV cache grows with concurrent sequences; approaching 100% triggers preemption
- HandlerTime -- end-to-end latency including token generation; long-tail distribution is expected for autoregressive models
- ts_inference_requests_total -- throughput tracking for SLA monitoring
Theoretical Basis
Prometheus Pull Model
The pull-based metrics model (where a central server scrapes endpoints) has several advantages over push-based approaches for model serving:
- Decoupled collection -- the model server does not need to know about the monitoring infrastructure
- No backpressure -- if the monitoring system is slow, it simply skips a scrape interval without affecting the serving path
- Service discovery -- Prometheus can automatically discover and scrape new model server instances
RED Method
TorchServe's metrics align with the RED method (Rate, Errors, Duration) for monitoring request-driven services:
- Rate --
ts_inference_requests_totalprovides the request rate - Errors --
Requests4XXandRequests5XXtrack error rates - Duration --
ts_inference_latency_microseconds,HandlerTime, andPredictionTimemeasure latency
USE Method for System Resources
For system-level metrics, TorchServe follows the USE method (Utilization, Saturation, Errors):
- Utilization --
CPUUtilization,GPUUtilization,MemoryUtilization,DiskUtilization - Saturation --
QueueTimeindicates when the system is saturated (requests queuing) - Errors --
Requests5XXcaptures system-level failures
The combination of RED and USE methods provides comprehensive observability across both the application layer (model inference) and the infrastructure layer (compute, memory, storage).
Related Pages
- Implementation:Pytorch_Serve_Metrics_API -- the concrete HTTP API for retrieving Prometheus-formatted metrics from TorchServe