Implementation:Kserve Kserve LLM Prefill Worker DP Config

Knowledge Sources	Kserve_Kserve
Domains	Kubernetes, LLM Inference, Data Parallelism, RDMA
Last Updated	2026-02-13 00:00 GMT

Overview

This file defines the pod template configuration for LLM prefill workers with data-parallel execution and automatic RoCE/InfiniBand network discovery.

Description

Specified as an LLMInferenceServiceConfig (v1alpha2), this configuration places the container template under a prefill section, distinguishing it from decode or standard worker configs. The main llm-d-cuda container includes the same RoCE auto-detection bash startup script used in the decode worker, which discovers active mlx5 HCA devices, determines optimal GID indices for SR-IOV environments, and configures NCCL/NVSHMEM/UCX InfiniBand settings. Unlike the decode worker config, this does not include a routing sidecar init container since prefill workers receive work from the routing layer indirectly.

Usage

Use this configuration as the base template for prefill workers in a disaggregated prefill-decode (PD) LLM serving architecture. It is referenced by LLMInferenceService resources that require dedicated prefill pools for prompt processing before handing off KV cache state to decode workers.

Code Reference

Source Location

Repository: Kserve_Kserve
File: config/llmisvcconfig/config-llm-prefill-worker-data-parallel.yaml

Signature

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceServiceConfig
metadata:
  name: kserve-config-llm-prefill-worker-data-parallel
spec:
  prefill:
    template:
      containers:
        - image: ghcr.io/llm-d/llm-d-cuda:v0.4.0
          imagePullPolicy: IfNotPresent
          name: main
          ports:
            - containerPort: 8000
              protocol: TCP
          command:
            - "/bin/bash"
            - "-c"
          args:
            - |-
              # Auto-detect RoCE HCAs, configure NCCL/NVSHMEM/UCX, launch vllm serve

Import

kubectl apply -f config/llmisvcconfig/config-llm-prefill-worker-data-parallel.yaml

I/O Contract

Main Container: llm-d-cuda

Component	Description
Image	`ghcr.io/llm-d/llm-d-cuda:v0.4.0`
Port	8000 (TCP)
Startup Script	Auto-detects RoCE HCAs, configures NCCL_IB_HCA, NVSHMEM_HCA_LIST, UCX_NET_DEVICES, GID indices
vLLM Launch	`vllm serve` with data-parallel, tensor-parallel, and expert-parallel flags

Key Differences from Decode Worker

Aspect	Prefill Worker	Decode Worker
Config Section	`spec.prefill.template`	`spec.template`
Routing Sidecar	Not included	`llm-d-routing-sidecar` with NIXL v2
Serving Port	8000	8001 (behind sidecar on 8000)
Role	Processes initial prompts (prefill phase)	Generates tokens (decode phase)

Environment Variables Configured by Startup Script

Variable	Description
`NCCL_IB_HCA`	Comma-separated list of active HCA device names
`NVSHMEM_HCA_LIST`	HCA list for NVSHMEM library
`UCX_NET_DEVICES`	UCX network devices (HCA:port format)
`NCCL_IB_GID_INDEX`	InfiniBand GID index for RoCE v2
`NVSHMEM_IB_GID_INDEX`	GID index for NVSHMEM
`UCX_IB_GID_INDEX`	GID index for UCX

Usage Examples

# Apply the prefill worker configuration
kubectl apply -f config/llmisvcconfig/config-llm-prefill-worker-data-parallel.yaml

# Verify the config is created
kubectl get llminferenceserviceconfig kserve-config-llm-prefill-worker-data-parallel

Related Pages

Kserve_Kserve_LLM_Decode_Worker_DP_Config - Companion decode worker configuration for disaggregated inference
Kserve_Kserve_LLM_Worker_DP_Config - Standard (non-disaggregated) worker configuration
Kserve_Kserve_LLMInferenceServiceConfig_Minimal_CRD - CRD definition for the LLMInferenceServiceConfig resource
Kserve_Kserve_DeepSeek_R1_PD_DeepEP_HT_Sample - Sample that uses prefill workers in a PD architecture

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment