Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Kserve Kserve LLM Prefill Worker DP Config

From Leeroopedia
Knowledge Sources
Domains Kubernetes, LLM Inference, Data Parallelism, RDMA
Last Updated 2026-02-13 00:00 GMT

Overview

This file defines the pod template configuration for LLM prefill workers with data-parallel execution and automatic RoCE/InfiniBand network discovery.

Description

Specified as an LLMInferenceServiceConfig (v1alpha2), this configuration places the container template under a prefill section, distinguishing it from decode or standard worker configs. The main llm-d-cuda container includes the same RoCE auto-detection bash startup script used in the decode worker, which discovers active mlx5 HCA devices, determines optimal GID indices for SR-IOV environments, and configures NCCL/NVSHMEM/UCX InfiniBand settings. Unlike the decode worker config, this does not include a routing sidecar init container since prefill workers receive work from the routing layer indirectly.

Usage

Use this configuration as the base template for prefill workers in a disaggregated prefill-decode (PD) LLM serving architecture. It is referenced by LLMInferenceService resources that require dedicated prefill pools for prompt processing before handing off KV cache state to decode workers.

Code Reference

Source Location

Signature

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceServiceConfig
metadata:
  name: kserve-config-llm-prefill-worker-data-parallel
spec:
  prefill:
    template:
      containers:
        - image: ghcr.io/llm-d/llm-d-cuda:v0.4.0
          imagePullPolicy: IfNotPresent
          name: main
          ports:
            - containerPort: 8000
              protocol: TCP
          command:
            - "/bin/bash"
            - "-c"
          args:
            - |-
              # Auto-detect RoCE HCAs, configure NCCL/NVSHMEM/UCX, launch vllm serve

Import

kubectl apply -f config/llmisvcconfig/config-llm-prefill-worker-data-parallel.yaml

I/O Contract

Main Container: llm-d-cuda

Component Description
Image ghcr.io/llm-d/llm-d-cuda:v0.4.0
Port 8000 (TCP)
Startup Script Auto-detects RoCE HCAs, configures NCCL_IB_HCA, NVSHMEM_HCA_LIST, UCX_NET_DEVICES, GID indices
vLLM Launch vllm serve with data-parallel, tensor-parallel, and expert-parallel flags

Key Differences from Decode Worker

Aspect Prefill Worker Decode Worker
Config Section spec.prefill.template spec.template
Routing Sidecar Not included llm-d-routing-sidecar with NIXL v2
Serving Port 8000 8001 (behind sidecar on 8000)
Role Processes initial prompts (prefill phase) Generates tokens (decode phase)

Environment Variables Configured by Startup Script

Variable Description
NCCL_IB_HCA Comma-separated list of active HCA device names
NVSHMEM_HCA_LIST HCA list for NVSHMEM library
UCX_NET_DEVICES UCX network devices (HCA:port format)
NCCL_IB_GID_INDEX InfiniBand GID index for RoCE v2
NVSHMEM_IB_GID_INDEX GID index for NVSHMEM
UCX_IB_GID_INDEX GID index for UCX

Usage Examples

# Apply the prefill worker configuration
kubectl apply -f config/llmisvcconfig/config-llm-prefill-worker-data-parallel.yaml

# Verify the config is created
kubectl get llminferenceserviceconfig kserve-config-llm-prefill-worker-data-parallel

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment