Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Kserve Kserve SRIOV RDMA Network

From Leeroopedia
Knowledge Sources
Domains Infrastructure, High_Performance_Networking
Last Updated 2026-02-13 14:00 GMT

Overview

SR-IOV and RDMA over Converged Ethernet (RoCE) networking stack for high-bandwidth KV cache transfer in disaggregated LLM inference.

Description

Multi-node LLM inference with prefill-decode separation requires ultra-low-latency, high-bandwidth interconnects for transferring KV caches between prefill and decode pools. The SR-IOV device plugin exposes RDMA-capable virtual functions (VFs) to pods, and RoCE v2 provides the transport layer. NCCL and NVSHMEM environment variables are auto-detected from the InfiniBand HCA configuration.

Usage

Use this environment for multi-node disaggregated LLM inference with prefill-decode separation (DP-EP patterns) where KV cache must be transferred between pods on different nodes via RDMA.

System Requirements

Category Requirement Notes
Hardware Mellanox ConnectX NICs With SR-IOV and RoCE v2 support
Kubernetes >= 1.24 With Multus CNI plugin
SR-IOV Device Plugin Latest Exposes VFs to pods
SR-IOV Network Operator Latest Manages SriovNetworkNodePolicy
RDMA Protocol RoCE v2 RDMA over Converged Ethernet

Dependencies

Kubernetes Operators

  • SR-IOV Network Operator
  • Multus CNI (for multiple network interfaces)

Pod Resources

  • `rdma/roce_gdr` resource type (1 per pod)
  • Network attachment definitions for RDMA networks

Credentials

No additional credentials required.

Quick Install

# Apply SR-IOV network node policy and network attachment
kubectl apply -f docs/samples/llmisvc/dp-ep/deepseek-r1-gpu-rdma-roce/network-roce.yaml

Code Evidence

RDMA network configuration from `docs/samples/llmisvc/dp-ep/deepseek-r1-gpu-rdma-roce/network-roce.yaml`:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: roce-gdr-policy
spec:
  deviceType: netdevice
  isRdma: true
  resourceName: roce_gdr

RoCE auto-detection from `config/llmisvcconfig/config-llm-worker-data-parallel.yaml`:

# Detect active HCAs
for hca_dir in /sys/class/infiniband/mlx5_*; do
    if grep -q "ACTIVE" "$port_state_file" && grep -q "RoCE v2" ${type_file}; then
        # Add to active HCAs
    fi
done

# For SR-IOV, prefer GID_INDEX=3
if [ -n "${gid_index_count['3']}" ] && [ "${gid_index_count['3']}" -eq "$max_count" ]; then
    best_gid_index="3"
fi

Common Errors

Error Message Cause Solution
`rdma/roce_gdr` resource not available SR-IOV device plugin not installed Deploy SR-IOV Network Operator and apply node policy
NCCL timeout on multi-node Incorrect GID index Verify `NCCL_IB_GID_INDEX` matches active RoCE v2 port

Compatibility Notes

  • SR-IOV GID index: Default preferred value is 3 for SR-IOV environments
  • NVSHMEM: Uses same GID index as NCCL for consistency
  • UCX transport: Auto-configured from detected InfiniBand HCAs
  • Pod anti-affinity: Can force KV transfer over RDMA instead of NVLink

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment