Environment:Kserve Kserve SRIOV RDMA Network
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, High_Performance_Networking |
| Last Updated | 2026-02-13 14:00 GMT |
Overview
SR-IOV and RDMA over Converged Ethernet (RoCE) networking stack for high-bandwidth KV cache transfer in disaggregated LLM inference.
Description
Multi-node LLM inference with prefill-decode separation requires ultra-low-latency, high-bandwidth interconnects for transferring KV caches between prefill and decode pools. The SR-IOV device plugin exposes RDMA-capable virtual functions (VFs) to pods, and RoCE v2 provides the transport layer. NCCL and NVSHMEM environment variables are auto-detected from the InfiniBand HCA configuration.
Usage
Use this environment for multi-node disaggregated LLM inference with prefill-decode separation (DP-EP patterns) where KV cache must be transferred between pods on different nodes via RDMA.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Hardware | Mellanox ConnectX NICs | With SR-IOV and RoCE v2 support |
| Kubernetes | >= 1.24 | With Multus CNI plugin |
| SR-IOV Device Plugin | Latest | Exposes VFs to pods |
| SR-IOV Network Operator | Latest | Manages SriovNetworkNodePolicy |
| RDMA Protocol | RoCE v2 | RDMA over Converged Ethernet |
Dependencies
Kubernetes Operators
- SR-IOV Network Operator
- Multus CNI (for multiple network interfaces)
Pod Resources
- `rdma/roce_gdr` resource type (1 per pod)
- Network attachment definitions for RDMA networks
Credentials
No additional credentials required.
Quick Install
# Apply SR-IOV network node policy and network attachment
kubectl apply -f docs/samples/llmisvc/dp-ep/deepseek-r1-gpu-rdma-roce/network-roce.yaml
Code Evidence
RDMA network configuration from `docs/samples/llmisvc/dp-ep/deepseek-r1-gpu-rdma-roce/network-roce.yaml`:
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: roce-gdr-policy
spec:
deviceType: netdevice
isRdma: true
resourceName: roce_gdr
RoCE auto-detection from `config/llmisvcconfig/config-llm-worker-data-parallel.yaml`:
# Detect active HCAs
for hca_dir in /sys/class/infiniband/mlx5_*; do
if grep -q "ACTIVE" "$port_state_file" && grep -q "RoCE v2" ${type_file}; then
# Add to active HCAs
fi
done
# For SR-IOV, prefer GID_INDEX=3
if [ -n "${gid_index_count['3']}" ] && [ "${gid_index_count['3']}" -eq "$max_count" ]; then
best_gid_index="3"
fi
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `rdma/roce_gdr` resource not available | SR-IOV device plugin not installed | Deploy SR-IOV Network Operator and apply node policy |
| NCCL timeout on multi-node | Incorrect GID index | Verify `NCCL_IB_GID_INDEX` matches active RoCE v2 port |
Compatibility Notes
- SR-IOV GID index: Default preferred value is 3 for SR-IOV environments
- NVSHMEM: Uses same GID index as NCCL for consistency
- UCX transport: Auto-configured from detected InfiniBand HCAs
- Pod anti-affinity: Can force KV transfer over RDMA instead of NVLink