Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Triton inference server Server Request Tracing

From Leeroopedia
Revision as of 17:57, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Triton_inference_server_Server_Request_Tracing.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Overview

Request Tracing is the principle governing how Triton Inference Server instruments inference requests with detailed timing and activity information for performance analysis and distributed observability. The TraceManager class provides a dual-mode tracing system: a native Triton file-based trace mode that captures request lifecycle events to JSON files, and an OpenTelemetry mode that exports trace spans to external observability backends via the OTLP HTTP protocol. Tracing can be configured globally, per-model, and dynamically updated at runtime through the trace API.

Theoretical Basis

Why Tracing Matters for Inference Serving

Understanding where time is spent during inference is essential for optimization. A single inference request may traverse multiple stages: request queuing, scheduling, model execution, ensemble sub-model dispatching, memory allocation, and response serialization. Without tracing, operators have no visibility into which stage is the bottleneck. Furthermore, in microservice architectures where inference is one hop in a larger pipeline, distributed tracing enables end-to-end latency analysis across services.

Dual Trace Modes

Mode Export Target Use Case
TRACE_MODE_TRITON JSON file (configurable path) Offline analysis, development profiling
TRACE_MODE_OPENTELEMETRY OTLP HTTP endpoint (Jaeger, Zipkin, etc.) Production distributed tracing, real-time observability

The Triton-native mode records timestamped events into std::stringstream objects, grouped by trace ID, and periodically flushes them to indexed JSON files. The OpenTelemetry mode creates OTel spans with proper parent-child relationships, attributes (model name, version, trace ID), and timestamps, exporting them through the OpenTelemetry C++ SDK's batch span processor.

Hierarchical Trace Configuration

Trace settings follow a hierarchical model:

  1. Global default: Initial settings from server startup flags
  2. Global active: Current global settings, modifiable at runtime via the trace API
  3. Per-model: Model-specific overrides that inherit unspecified fields from the global setting

The TraceSetting class tracks which fields are explicitly specified (level_specified_, rate_specified_, etc.) versus inherited, enabling clean fallback behavior. When the global setting is updated, all per-model settings that inherit from it are automatically refreshed.

Sampling Control

Three parameters control trace sampling:

  • level: A bitmask of TRITONSERVER_InferenceTraceLevel values controlling which activities to trace (timestamps, tensor data, etc.)
  • rate: Sample one out of every N requests (e.g., rate=1000 means trace every 1000th request)
  • count: Maximum number of traces to collect (-1 for unlimited, 0 to disable)

The SampleTrace() method atomically increments a counter and creates a trace object only when the counter aligns with the rate and the count limit has not been reached. In OpenTelemetry mode, a force_sample flag bypasses rate limiting when the incoming request already carries a propagated trace context from an upstream service.

Trace Activity Lifecycle

The Triton core invokes registered callbacks at specific lifecycle points:

REQUEST_START -> QUEUE_START -> COMPUTE_START -> COMPUTE_INPUT_END
    -> COMPUTE_OUTPUT_START -> COMPUTE_END -> REQUEST_END

Each activity callback receives a nanosecond-precision timestamp from std::chrono::steady_clock. In Triton mode, these timestamps are written as JSON. In OpenTelemetry mode, activity pairs (START/END) map to span start/end times, with intervening events recorded as span events.

OpenTelemetry Context Propagation

In OpenTelemetry mode, the server extracts trace context from incoming HTTP headers using the W3C Trace Context specification (traceparent, tracestate). The HttpTextMapCarrier class adapts evhtp's key-value headers to the OpenTelemetry TextMapCarrier interface for context extraction. When spawning child traces (e.g., for ensemble sub-models), the trace context is serialized and propagated via PrepareTraceContext().

Span Stack Architecture

For OpenTelemetry mode, each trace maintains a stack of spans keyed by trace ID. When a REQUEST_START or COMPUTE_START activity arrives, a new span is pushed onto the stack. When the corresponding END activity arrives, the top span is popped and ended. This stack-based approach naturally handles nested spans from ensemble models and Business Logic Scripting (BLS) where a single request may spawn multiple child traces.

Thread-Safe File Output

The TraceFile class manages concurrent writes to trace output files using a mutex. It supports two write modes: writing to a single file (for low-frequency tracing) and writing to indexed files (trace.0, trace.1, etc.) controlled by the log_frequency parameter. The sample_in_stream_ counter triggers a flush when the number of in-memory traces reaches the frequency threshold.

Runtime Trace Configuration Updates

The UpdateTraceSetting() API allows live modification of trace parameters without server restart. This is exposed through the HTTP /v2/trace endpoint, enabling operators to dynamically enable tracing, adjust sampling rates, or change trace output files in production.

Related Pages

Implementation:Triton_inference_server_Server_Tracer Triton_inference_server_Server

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment