Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Pytorch Serve Metrics Monitoring

From Leeroopedia
Field Value
Page Type Principle
Domains Monitoring, Infrastructure
Knowledge Sources TorchServe
Workflow LLM_Deployment_vLLM
Last Updated 2026-02-13 00:00 GMT

Overview

Production observability for model serving requires collecting system metrics (CPU, GPU, memory), model metrics (latency, throughput), and exposing them in a standardized format for monitoring infrastructure. TorchServe provides a dedicated metrics API that emits Prometheus-formatted metrics, enabling integration with Prometheus servers, Grafana dashboards, and alerting systems. For LLM deployments with vLLM, these metrics are essential for capacity planning, performance tuning, and operational reliability.

Description

The Three Pillars of Model Serving Observability

Model serving observability in TorchServe is organized into three categories of metrics:

1. System Metrics -- hardware resource utilization at the host level:

  • CPUUtilization (Percent) -- processor usage across all cores
  • MemoryUsed / MemoryAvailable (Megabytes) -- RAM consumption and headroom
  • MemoryUtilization (Percent) -- percentage of total RAM in use
  • DiskUsage / DiskAvailable (Gigabytes) -- storage consumption
  • DiskUtilization (Percent) -- percentage of disk capacity used
  • GPUUtilization (Percent) -- GPU compute utilization (requires pynvml)
  • GPUMemoryUtilization (Percent) -- GPU memory utilization
  • GPUMemoryUsed (Megabytes) -- absolute GPU memory consumption

These metrics are critical for LLM serving because large models consume significant GPU memory and compute resources. Monitoring GPU memory utilization helps prevent out-of-memory (OOM) errors, while GPU utilization indicates whether the model is effectively using available compute.

2. Model Metrics -- per-model performance measurements:

  • HandlerTime (ms) -- total time spent in the Python handler (preprocess + inference + postprocess). For the VLLMHandler, this includes the full async pipeline.
  • PredictionTime (ms) -- time for the inference step alone, including vLLM engine processing
  • QueueTime (ms) -- time a request spends waiting in the TorchServe queue before being dispatched to a worker

These metrics reveal the latency breakdown for each request. High QueueTime indicates the server needs more workers or the model is saturated. High HandlerTime relative to PredictionTime suggests preprocessing overhead.

3. Counter Metrics -- aggregate request statistics:

  • Requests2XX / Requests4XX / Requests5XX (Count) -- HTTP response code counters
  • ts_inference_requests_total (Count) -- total inference requests per model and version
  • ts_inference_latency_microseconds (Microseconds) -- cumulative inference latency counter
  • ts_queue_latency_microseconds (Microseconds) -- cumulative queue wait time counter
  • WorkerLoadTime (Milliseconds) -- time taken to load a model into a worker
  • WorkerThreadTime (Milliseconds) -- time spent in worker threads

Metrics Exposure Model

TorchServe uses a pull-based metrics model where a dedicated HTTP endpoint serves metrics in Prometheus text format. This aligns with the Prometheus ecosystem's scraping architecture:

  • The metrics endpoint listens on port 8082 by default
  • It is accessible only from localhost by default (configurable via config.properties)
  • Metrics are returned in Prometheus exposition format when metrics_mode is set to prometheus
  • The endpoint is enabled by default and can be disabled via enable_metrics_api=false

Metric Labels

Each metric includes contextual labels:

  • Level -- "Host" for system metrics, "Model" for model-specific metrics
  • Hostname -- the server's hostname for multi-host identification
  • ModelName -- the model name (for model-level metrics)
  • WorkerName -- the worker identifier (for worker-level metrics like WorkerLoadTime)
  • model_name / model_version -- for ts_inference counter metrics

Scaling Based on Metrics

TorchServe's management API supports dynamic scaling based on observed metrics:

  • PUT /models/{model_name} with min_worker and max_worker parameters adjusts the worker pool
  • Monitoring QueueTime and GPUUtilization enables data-driven scaling decisions
  • For vLLM workloads, scaling is typically done at the model instance level (adding replicas) rather than adding workers, since vLLM handles internal concurrency

Usage

Metrics monitoring is a continuous operational concern throughout the lifetime of a deployed model. The typical integration workflow:

  1. Enable metrics in TorchServe configuration (enabled by default)
  2. Configure Prometheus to scrape the :8082/metrics endpoint at regular intervals (e.g., 15 seconds)
  3. Build Grafana dashboards to visualize latency distributions, throughput, and resource utilization
  4. Set up alerts for critical thresholds (GPU memory > 90%, 5XX error rate > 1%, p99 latency > SLA)
  5. Use metrics for capacity planning -- correlate request volume with GPU utilization to right-size the deployment

For LLM workloads specifically, key metrics to monitor include:

  • GPUMemoryUtilization -- vLLM's KV cache grows with concurrent sequences; approaching 100% triggers preemption
  • HandlerTime -- end-to-end latency including token generation; long-tail distribution is expected for autoregressive models
  • ts_inference_requests_total -- throughput tracking for SLA monitoring

Theoretical Basis

Prometheus Pull Model

The pull-based metrics model (where a central server scrapes endpoints) has several advantages over push-based approaches for model serving:

  • Decoupled collection -- the model server does not need to know about the monitoring infrastructure
  • No backpressure -- if the monitoring system is slow, it simply skips a scrape interval without affecting the serving path
  • Service discovery -- Prometheus can automatically discover and scrape new model server instances

RED Method

TorchServe's metrics align with the RED method (Rate, Errors, Duration) for monitoring request-driven services:

  • Rate -- ts_inference_requests_total provides the request rate
  • Errors -- Requests4XX and Requests5XX track error rates
  • Duration -- ts_inference_latency_microseconds, HandlerTime, and PredictionTime measure latency

USE Method for System Resources

For system-level metrics, TorchServe follows the USE method (Utilization, Saturation, Errors):

  • Utilization -- CPUUtilization, GPUUtilization, MemoryUtilization, DiskUtilization
  • Saturation -- QueueTime indicates when the system is saturated (requests queuing)
  • Errors -- Requests5XX captures system-level failures

The combination of RED and USE methods provides comprehensive observability across both the application layer (model inference) and the infrastructure layer (compute, memory, storage).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment