Overview
This file defines the pod template configuration for LLM decode workers with data-parallel execution and NIXL v2 connector support for high-performance GPU communication.
Description
Specified as an LLMInferenceServiceConfig (v1alpha2), this configuration includes an llm-d-routing-sidecar init container using the NIXL v2 connector for KV cache transfer, and a main llm-d-cuda container that runs vLLM with an embedded bash startup script. The startup script auto-detects RoCE (RDMA over Converged Ethernet) HCA devices, discovers active mlx5 interfaces, determines the optimal GID index for SR-IOV environments, and configures NCCL/NVSHMEM/UCX InfiniBand settings before launching multi-GPU data-parallel inference via vllm serve.
Usage
Use this configuration as the base template for decode workers in a disaggregated prefill-decode LLM serving architecture. It is referenced by LLMInferenceService resources that require dedicated decode pools with data-parallel execution and RDMA networking.
Code Reference
Source Location
Signature
apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceServiceConfig
metadata:
name: kserve-config-llm-decode-worker-data-parallel
spec:
template:
initContainers:
- name: llm-d-routing-sidecar
image: ghcr.io/llm-d/llm-d-routing-sidecar:v0.4.0
restartPolicy: Always
ports:
- containerPort: 8000
args:
- "--port=8000"
- "--vllm-port=8001"
- "--connector=nixlv2"
- "--secure-proxy=false"
containers:
- image: ghcr.io/llm-d/llm-d-cuda:v0.4.0
name: main
ports:
- containerPort: 8001
command:
- "/bin/bash"
- "-c"
args:
- |-
# Auto-detect RoCE HCAs, configure NCCL/NVSHMEM/UCX, launch vllm serve
Import
kubectl apply -f config/llmisvcconfig/config-llm-decode-worker-data-parallel.yaml
I/O Contract
Init Container: llm-d-routing-sidecar
| Parameter |
Value |
Description
|
--port |
8000 |
External-facing proxy port
|
--vllm-port |
8001 |
Internal vLLM engine port
|
--connector |
nixlv2 |
KV cache transfer connector type
|
--secure-proxy |
false |
TLS disabled (BackendTLSPolicy not yet implemented)
|
Main Container: llm-d-cuda
| Component |
Description
|
| Image |
ghcr.io/llm-d/llm-d-cuda:v0.4.0
|
| Port |
8001 (TCP)
|
| Startup Script |
Auto-detects RoCE HCAs, configures NCCL_IB_HCA, NVSHMEM_HCA_LIST, UCX_NET_DEVICES, GID indices
|
| vLLM Launch |
vllm serve with data-parallel, tensor-parallel, and expert-parallel flags
|
Environment Variables Configured by Startup Script
| Variable |
Description
|
NCCL_IB_HCA |
Comma-separated list of active HCA device names
|
NVSHMEM_HCA_LIST |
HCA list for NVSHMEM library
|
UCX_NET_DEVICES |
UCX network devices (HCA:port format)
|
NCCL_IB_GID_INDEX |
InfiniBand GID index for RoCE v2
|
NVSHMEM_IB_GID_INDEX |
GID index for NVSHMEM
|
UCX_IB_GID_INDEX |
GID index for UCX
|
Volume Mounts
| Mount Path |
Volume |
Description
|
/home |
home (emptyDir) |
Writable home directory
|
/dev/shm |
dshm (emptyDir, Memory) |
Shared memory for inter-process communication (1Gi)
|
/models |
model-cache (emptyDir) |
HuggingFace model cache
|
/etc/ssl/certs |
tls-certs (secret) |
TLS certificates (read-only)
|
Usage Examples
# Apply the decode worker configuration
kubectl apply -f config/llmisvcconfig/config-llm-decode-worker-data-parallel.yaml
# Verify the config is created
kubectl get llminferenceserviceconfig kserve-config-llm-decode-worker-data-parallel
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.