Overview
This file defines the pod template configuration for LLM prefill workers with data-parallel execution and automatic RoCE/InfiniBand network discovery.
Description
Specified as an LLMInferenceServiceConfig (v1alpha2), this configuration places the container template under a prefill section, distinguishing it from decode or standard worker configs. The main llm-d-cuda container includes the same RoCE auto-detection bash startup script used in the decode worker, which discovers active mlx5 HCA devices, determines optimal GID indices for SR-IOV environments, and configures NCCL/NVSHMEM/UCX InfiniBand settings. Unlike the decode worker config, this does not include a routing sidecar init container since prefill workers receive work from the routing layer indirectly.
Usage
Use this configuration as the base template for prefill workers in a disaggregated prefill-decode (PD) LLM serving architecture. It is referenced by LLMInferenceService resources that require dedicated prefill pools for prompt processing before handing off KV cache state to decode workers.
Code Reference
Source Location
Signature
apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceServiceConfig
metadata:
name: kserve-config-llm-prefill-worker-data-parallel
spec:
prefill:
template:
containers:
- image: ghcr.io/llm-d/llm-d-cuda:v0.4.0
imagePullPolicy: IfNotPresent
name: main
ports:
- containerPort: 8000
protocol: TCP
command:
- "/bin/bash"
- "-c"
args:
- |-
# Auto-detect RoCE HCAs, configure NCCL/NVSHMEM/UCX, launch vllm serve
Import
kubectl apply -f config/llmisvcconfig/config-llm-prefill-worker-data-parallel.yaml
I/O Contract
Main Container: llm-d-cuda
| Component |
Description
|
| Image |
ghcr.io/llm-d/llm-d-cuda:v0.4.0
|
| Port |
8000 (TCP)
|
| Startup Script |
Auto-detects RoCE HCAs, configures NCCL_IB_HCA, NVSHMEM_HCA_LIST, UCX_NET_DEVICES, GID indices
|
| vLLM Launch |
vllm serve with data-parallel, tensor-parallel, and expert-parallel flags
|
Key Differences from Decode Worker
| Aspect |
Prefill Worker |
Decode Worker
|
| Config Section |
spec.prefill.template |
spec.template
|
| Routing Sidecar |
Not included |
llm-d-routing-sidecar with NIXL v2
|
| Serving Port |
8000 |
8001 (behind sidecar on 8000)
|
| Role |
Processes initial prompts (prefill phase) |
Generates tokens (decode phase)
|
Environment Variables Configured by Startup Script
| Variable |
Description
|
NCCL_IB_HCA |
Comma-separated list of active HCA device names
|
NVSHMEM_HCA_LIST |
HCA list for NVSHMEM library
|
UCX_NET_DEVICES |
UCX network devices (HCA:port format)
|
NCCL_IB_GID_INDEX |
InfiniBand GID index for RoCE v2
|
NVSHMEM_IB_GID_INDEX |
GID index for NVSHMEM
|
UCX_IB_GID_INDEX |
GID index for UCX
|
Usage Examples
# Apply the prefill worker configuration
kubectl apply -f config/llmisvcconfig/config-llm-prefill-worker-data-parallel.yaml
# Verify the config is created
kubectl get llminferenceserviceconfig kserve-config-llm-prefill-worker-data-parallel
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.