Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Triton inference server Server Shared Memory Management

From Leeroopedia


Overview

Shared Memory Management is the principle governing how Triton Inference Server enables zero-copy inter-process communication (IPC) for inference input and output data through POSIX shared memory (CPU) and CUDA IPC shared memory (GPU) regions. The SharedMemoryManager class provides a thread-safe registry of named shared memory regions that clients can pre-register before inference, allowing tensor data to be passed directly from client address space to the server's inference pipeline without serialization or network transfer overhead.

Theoretical Basis

Why Shared Memory for Inference

In high-throughput inference deployments, the cost of serializing tensor data into HTTP/gRPC messages, transmitting over the network stack (even loopback), and deserializing on the server side can become a significant bottleneck. For co-located clients (same host or same container), POSIX shared memory and CUDA IPC memory provide a mechanism to share data at memory-bus speeds with zero copies. This is particularly impactful for:

  • Large tensor inputs: Image batches, video frames, or high-dimensional embeddings that may be megabytes per request.
  • GPU-resident data: When the client application already has data on GPU (e.g., a preprocessing pipeline), CUDA IPC allows the server to read directly from the client's GPU memory without a device-to-host-to-device round trip.
  • Latency-sensitive applications: Real-time inference where every microsecond of data transfer matters.

Dual Memory Type Support

The manager supports two fundamentally different shared memory mechanisms:

Memory Type Registration API Underlying Mechanism
System (CPU) RegisterSystemSharedMemory(name, shm_key, offset, byte_size) POSIX shm_open() + mmap()
CUDA (GPU) RegisterCUDASharedMemory(name, handle, byte_size, device_id) cudaIpcOpenMemHandle()

System shared memory uses POSIX shared memory objects identified by a string key (/dev/shm on Linux). The manager opens the shared memory file descriptor, maps it into the server's address space at the specified offset, and records the mapping.

CUDA shared memory uses cudaIpcMemHandle_t handles that the client obtains from cudaIpcGetMemHandle() and passes to the server. The server opens the handle with cudaIpcOpenMemHandle() to obtain a device pointer valid in its own CUDA context.

SharedMemoryInfo Registry

Each registered region is tracked through a SharedMemoryInfo struct containing:

  • name_: Unique identifier for the region
  • shm_key_: POSIX shared memory object name (for system memory)
  • offset_: Byte offset within the shared memory object
  • byte_size_: Size of the registered region
  • shm_fd_: File descriptor (for system memory)
  • mapped_addr_: Pointer to the mapped memory
  • kind_: TRITONSERVER_MEMORY_CPU or TRITONSERVER_MEMORY_GPU
  • device_id_: GPU device ID (for CUDA memory)

CUDA regions additionally store the cudaIpcMemHandle_t in a CUDASharedMemoryInfo subclass, enabling the HTTP/gRPC server to include the IPC handle in responses when outputs are placed in CUDA shared memory.

Thread-Safe Access

All operations on the shared memory map are protected by a mutex (mu_), ensuring correctness when multiple inference threads concurrently register, query, or unregister shared memory regions. The GetMemoryInfo() method additionally returns a std::shared_ptr<const SharedMemoryInfo> that increments a reference count, preventing a region from being unregistered while an in-flight inference request is still reading from it. The awaiting_unregister_ flag allows deferred cleanup: if an unregister request arrives while references are held, the actual cleanup occurs when the last reference is released.

Bounds Checking

The GetMemoryInfo() method validates that the requested offset + byte_size does not exceed the registered region's bounds. This prevents out-of-bounds memory access that could cause crashes or security vulnerabilities.

Status Reporting

The GetStatus() method serializes the state of all registered regions (or a specific named region) as JSON, enabling clients to verify their shared memory registrations through the HTTP/gRPC API. Status includes the region name, key, offset, byte size, and device ID.

Unregistration and Cleanup

The Unregister() and UnregisterAll() methods close file descriptors, unmap memory, and (for CUDA) close IPC memory handles. The destructor ensures all regions are cleaned up when the manager is destroyed, preventing resource leaks.

Related Pages

Implementation:Triton_inference_server_Server_SharedMemoryManager Triton_inference_server_Server

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment