Overview
This file defines the pod template configuration for standard (non-disaggregated) LLM workers with data-parallel execution and automatic RoCE/InfiniBand network discovery.
Description
Specified as an LLMInferenceServiceConfig (v1alpha2), this configuration provides a unified worker template that handles both prefill and decode phases within the same pod, as opposed to the disaggregated prefill/decode approach. The main llm-d-cuda container includes the same RoCE auto-detection bash startup script, which discovers active mlx5 HCA devices, determines optimal GID indices for SR-IOV environments, and configures NCCL/NVSHMEM/UCX InfiniBand settings before launching multi-GPU data-parallel inference.
Usage
Use this configuration as the base template for LLM workers in a unified (non-disaggregated) serving architecture. This is the simpler deployment pattern where each worker handles the full inference pipeline without separating prefill and decode phases.
Code Reference
Source Location
Signature
apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceServiceConfig
metadata:
name: kserve-config-llm-worker-data-parallel
spec:
template:
containers:
- image: ghcr.io/llm-d/llm-d-cuda:v0.4.0
imagePullPolicy: IfNotPresent
name: main
ports:
- containerPort: 8000
protocol: TCP
command:
- "/bin/bash"
- "-c"
args:
- |-
# Auto-detect RoCE HCAs, configure NCCL/NVSHMEM/UCX, launch vllm serve
Import
kubectl apply -f config/llmisvcconfig/config-llm-worker-data-parallel.yaml
I/O Contract
Main Container: llm-d-cuda
| Component |
Description
|
| Image |
ghcr.io/llm-d/llm-d-cuda:v0.4.0
|
| Port |
8000 (TCP)
|
| Startup Script |
Auto-detects RoCE HCAs, configures NCCL_IB_HCA, NVSHMEM_HCA_LIST, UCX_NET_DEVICES, GID indices
|
| vLLM Launch |
vllm serve with data-parallel, tensor-parallel, and expert-parallel flags
|
Comparison with Disaggregated Configs
| Aspect |
Standard Worker |
Decode Worker |
Prefill Worker
|
| Config Name |
kserve-config-llm-worker-data-parallel |
kserve-config-llm-decode-worker-data-parallel |
kserve-config-llm-prefill-worker-data-parallel
|
| Config Section |
spec.template |
spec.template |
spec.prefill.template
|
| Routing Sidecar |
Not included |
Included (NIXL v2) |
Not included
|
| Role |
Full inference pipeline |
Token generation only |
Prompt processing only
|
Environment Variables Configured by Startup Script
| Variable |
Description
|
NCCL_IB_HCA |
Comma-separated list of active HCA device names
|
NVSHMEM_HCA_LIST |
HCA list for NVSHMEM library
|
UCX_NET_DEVICES |
UCX network devices (HCA:port format)
|
NCCL_IB_GID_INDEX |
InfiniBand GID index for RoCE v2
|
NVSHMEM_IB_GID_INDEX |
GID index for NVSHMEM
|
UCX_IB_GID_INDEX |
GID index for UCX
|
vLLM Serve Command Template
| Flag |
Source |
Description
|
--served-model-name |
.Spec.Model.Name |
Model name from the LLMInferenceService spec
|
--port |
hardcoded |
8000
|
--data-parallel-size |
.Spec.Parallelism.Data |
Number of data-parallel ranks (default: 1)
|
--data-parallel-size-local |
.Spec.Parallelism.DataLocal |
Local data-parallel ranks (default: 1)
|
--tensor-parallel-size |
.Spec.Parallelism.Tensor |
Tensor parallelism degree
|
--enable-expert-parallel |
.Spec.Parallelism.Expert |
Enable MoE expert parallelism
|
Usage Examples
# Apply the standard worker configuration
kubectl apply -f config/llmisvcconfig/config-llm-worker-data-parallel.yaml
# Verify the config is created
kubectl get llminferenceserviceconfig kserve-config-llm-worker-data-parallel
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.