Principle:Triton inference server Server Shared Memory Management
Overview
Shared Memory Management is the principle governing how Triton Inference Server enables zero-copy inter-process communication (IPC) for inference input and output data through POSIX shared memory (CPU) and CUDA IPC shared memory (GPU) regions. The SharedMemoryManager class provides a thread-safe registry of named shared memory regions that clients can pre-register before inference, allowing tensor data to be passed directly from client address space to the server's inference pipeline without serialization or network transfer overhead.
Theoretical Basis
In high-throughput inference deployments, the cost of serializing tensor data into HTTP/gRPC messages, transmitting over the network stack (even loopback), and deserializing on the server side can become a significant bottleneck. For co-located clients (same host or same container), POSIX shared memory and CUDA IPC memory provide a mechanism to share data at memory-bus speeds with zero copies. This is particularly impactful for:
- Large tensor inputs: Image batches, video frames, or high-dimensional embeddings that may be megabytes per request.
- GPU-resident data: When the client application already has data on GPU (e.g., a preprocessing pipeline), CUDA IPC allows the server to read directly from the client's GPU memory without a device-to-host-to-device round trip.
- Latency-sensitive applications: Real-time inference where every microsecond of data transfer matters.
Dual Memory Type Support
The manager supports two fundamentally different shared memory mechanisms:
| Memory Type | Registration API | Underlying Mechanism |
|---|---|---|
| System (CPU) | RegisterSystemSharedMemory(name, shm_key, offset, byte_size) |
POSIX shm_open() + mmap()
|
| CUDA (GPU) | RegisterCUDASharedMemory(name, handle, byte_size, device_id) |
cudaIpcOpenMemHandle()
|
System shared memory uses POSIX shared memory objects identified by a string key (/dev/shm on Linux). The manager opens the shared memory file descriptor, maps it into the server's address space at the specified offset, and records the mapping.
CUDA shared memory uses cudaIpcMemHandle_t handles that the client obtains from cudaIpcGetMemHandle() and passes to the server. The server opens the handle with cudaIpcOpenMemHandle() to obtain a device pointer valid in its own CUDA context.
Each registered region is tracked through a SharedMemoryInfo struct containing:
name_: Unique identifier for the regionshm_key_: POSIX shared memory object name (for system memory)offset_: Byte offset within the shared memory objectbyte_size_: Size of the registered regionshm_fd_: File descriptor (for system memory)mapped_addr_: Pointer to the mapped memorykind_:TRITONSERVER_MEMORY_CPUorTRITONSERVER_MEMORY_GPUdevice_id_: GPU device ID (for CUDA memory)
CUDA regions additionally store the cudaIpcMemHandle_t in a CUDASharedMemoryInfo subclass, enabling the HTTP/gRPC server to include the IPC handle in responses when outputs are placed in CUDA shared memory.
Thread-Safe Access
All operations on the shared memory map are protected by a mutex (mu_), ensuring correctness when multiple inference threads concurrently register, query, or unregister shared memory regions. The GetMemoryInfo() method additionally returns a std::shared_ptr<const SharedMemoryInfo> that increments a reference count, preventing a region from being unregistered while an in-flight inference request is still reading from it. The awaiting_unregister_ flag allows deferred cleanup: if an unregister request arrives while references are held, the actual cleanup occurs when the last reference is released.
Bounds Checking
The GetMemoryInfo() method validates that the requested offset + byte_size does not exceed the registered region's bounds. This prevents out-of-bounds memory access that could cause crashes or security vulnerabilities.
Status Reporting
The GetStatus() method serializes the state of all registered regions (or a specific named region) as JSON, enabling clients to verify their shared memory registrations through the HTTP/gRPC API. Status includes the region name, key, offset, byte size, and device ID.
Unregistration and Cleanup
The Unregister() and UnregisterAll() methods close file descriptors, unmap memory, and (for CUDA) close IPC memory handles. The destructor ensures all regions are cleaned up when the manager is destroyed, preventing resource leaks.
Related Pages
Implementation:Triton_inference_server_Server_SharedMemoryManager Triton_inference_server_Server