Principle:Triton inference server Server Request Tracing
Overview
Request Tracing is the principle governing how Triton Inference Server instruments inference requests with detailed timing and activity information for performance analysis and distributed observability. The TraceManager class provides a dual-mode tracing system: a native Triton file-based trace mode that captures request lifecycle events to JSON files, and an OpenTelemetry mode that exports trace spans to external observability backends via the OTLP HTTP protocol. Tracing can be configured globally, per-model, and dynamically updated at runtime through the trace API.
Theoretical Basis
Why Tracing Matters for Inference Serving
Understanding where time is spent during inference is essential for optimization. A single inference request may traverse multiple stages: request queuing, scheduling, model execution, ensemble sub-model dispatching, memory allocation, and response serialization. Without tracing, operators have no visibility into which stage is the bottleneck. Furthermore, in microservice architectures where inference is one hop in a larger pipeline, distributed tracing enables end-to-end latency analysis across services.
Dual Trace Modes
| Mode | Export Target | Use Case |
|---|---|---|
TRACE_MODE_TRITON |
JSON file (configurable path) | Offline analysis, development profiling |
TRACE_MODE_OPENTELEMETRY |
OTLP HTTP endpoint (Jaeger, Zipkin, etc.) | Production distributed tracing, real-time observability |
The Triton-native mode records timestamped events into std::stringstream objects, grouped by trace ID, and periodically flushes them to indexed JSON files. The OpenTelemetry mode creates OTel spans with proper parent-child relationships, attributes (model name, version, trace ID), and timestamps, exporting them through the OpenTelemetry C++ SDK's batch span processor.
Hierarchical Trace Configuration
Trace settings follow a hierarchical model:
- Global default: Initial settings from server startup flags
- Global active: Current global settings, modifiable at runtime via the trace API
- Per-model: Model-specific overrides that inherit unspecified fields from the global setting
The TraceSetting class tracks which fields are explicitly specified (level_specified_, rate_specified_, etc.) versus inherited, enabling clean fallback behavior. When the global setting is updated, all per-model settings that inherit from it are automatically refreshed.
Sampling Control
Three parameters control trace sampling:
- level: A bitmask of
TRITONSERVER_InferenceTraceLevelvalues controlling which activities to trace (timestamps, tensor data, etc.) - rate: Sample one out of every
Nrequests (e.g., rate=1000 means trace every 1000th request) - count: Maximum number of traces to collect (
-1for unlimited,0to disable)
The SampleTrace() method atomically increments a counter and creates a trace object only when the counter aligns with the rate and the count limit has not been reached. In OpenTelemetry mode, a force_sample flag bypasses rate limiting when the incoming request already carries a propagated trace context from an upstream service.
Trace Activity Lifecycle
The Triton core invokes registered callbacks at specific lifecycle points:
REQUEST_START -> QUEUE_START -> COMPUTE_START -> COMPUTE_INPUT_END
-> COMPUTE_OUTPUT_START -> COMPUTE_END -> REQUEST_END
Each activity callback receives a nanosecond-precision timestamp from std::chrono::steady_clock. In Triton mode, these timestamps are written as JSON. In OpenTelemetry mode, activity pairs (START/END) map to span start/end times, with intervening events recorded as span events.
OpenTelemetry Context Propagation
In OpenTelemetry mode, the server extracts trace context from incoming HTTP headers using the W3C Trace Context specification (traceparent, tracestate). The HttpTextMapCarrier class adapts evhtp's key-value headers to the OpenTelemetry TextMapCarrier interface for context extraction. When spawning child traces (e.g., for ensemble sub-models), the trace context is serialized and propagated via PrepareTraceContext().
Span Stack Architecture
For OpenTelemetry mode, each trace maintains a stack of spans keyed by trace ID. When a REQUEST_START or COMPUTE_START activity arrives, a new span is pushed onto the stack. When the corresponding END activity arrives, the top span is popped and ended. This stack-based approach naturally handles nested spans from ensemble models and Business Logic Scripting (BLS) where a single request may spawn multiple child traces.
Thread-Safe File Output
The TraceFile class manages concurrent writes to trace output files using a mutex. It supports two write modes: writing to a single file (for low-frequency tracing) and writing to indexed files (trace.0, trace.1, etc.) controlled by the log_frequency parameter. The sample_in_stream_ counter triggers a flush when the number of in-memory traces reaches the frequency threshold.
Runtime Trace Configuration Updates
The UpdateTraceSetting() API allows live modification of trace parameters without server restart. This is exposed through the HTTP /v2/trace endpoint, enabling operators to dynamically enable tracing, adjust sampling rates, or change trace output files in production.
Related Pages
Implementation:Triton_inference_server_Server_Tracer Triton_inference_server_Server