Implementation:Kserve Kserve LLM Worker DP Config

Knowledge Sources	Kserve_Kserve
Domains	Kubernetes, LLM Inference, Data Parallelism, RDMA
Last Updated	2026-02-13 00:00 GMT

Overview

This file defines the pod template configuration for standard (non-disaggregated) LLM workers with data-parallel execution and automatic RoCE/InfiniBand network discovery.

Description

Specified as an LLMInferenceServiceConfig (v1alpha2), this configuration provides a unified worker template that handles both prefill and decode phases within the same pod, as opposed to the disaggregated prefill/decode approach. The main llm-d-cuda container includes the same RoCE auto-detection bash startup script, which discovers active mlx5 HCA devices, determines optimal GID indices for SR-IOV environments, and configures NCCL/NVSHMEM/UCX InfiniBand settings before launching multi-GPU data-parallel inference.

Usage

Use this configuration as the base template for LLM workers in a unified (non-disaggregated) serving architecture. This is the simpler deployment pattern where each worker handles the full inference pipeline without separating prefill and decode phases.

Code Reference

Source Location

Repository: Kserve_Kserve
File: config/llmisvcconfig/config-llm-worker-data-parallel.yaml

Signature

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceServiceConfig
metadata:
  name: kserve-config-llm-worker-data-parallel
spec:
  template:
    containers:
      - image: ghcr.io/llm-d/llm-d-cuda:v0.4.0
        imagePullPolicy: IfNotPresent
        name: main
        ports:
          - containerPort: 8000
            protocol: TCP
        command:
          - "/bin/bash"
          - "-c"
        args:
          - |-
            # Auto-detect RoCE HCAs, configure NCCL/NVSHMEM/UCX, launch vllm serve

Import

kubectl apply -f config/llmisvcconfig/config-llm-worker-data-parallel.yaml

I/O Contract

Main Container: llm-d-cuda

Component	Description
Image	`ghcr.io/llm-d/llm-d-cuda:v0.4.0`
Port	8000 (TCP)
Startup Script	Auto-detects RoCE HCAs, configures NCCL_IB_HCA, NVSHMEM_HCA_LIST, UCX_NET_DEVICES, GID indices
vLLM Launch	`vllm serve` with data-parallel, tensor-parallel, and expert-parallel flags

Comparison with Disaggregated Configs

Aspect	Standard Worker	Decode Worker	Prefill Worker
Config Name	`kserve-config-llm-worker-data-parallel`	`kserve-config-llm-decode-worker-data-parallel`	`kserve-config-llm-prefill-worker-data-parallel`
Config Section	`spec.template`	`spec.template`	`spec.prefill.template`
Routing Sidecar	Not included	Included (NIXL v2)	Not included
Role	Full inference pipeline	Token generation only	Prompt processing only

Environment Variables Configured by Startup Script

Variable	Description
`NCCL_IB_HCA`	Comma-separated list of active HCA device names
`NVSHMEM_HCA_LIST`	HCA list for NVSHMEM library
`UCX_NET_DEVICES`	UCX network devices (HCA:port format)
`NCCL_IB_GID_INDEX`	InfiniBand GID index for RoCE v2
`NVSHMEM_IB_GID_INDEX`	GID index for NVSHMEM
`UCX_IB_GID_INDEX`	GID index for UCX

vLLM Serve Command Template

Flag	Source	Description
`--served-model-name`	`.Spec.Model.Name`	Model name from the LLMInferenceService spec
`--port`	hardcoded	8000
`--data-parallel-size`	`.Spec.Parallelism.Data`	Number of data-parallel ranks (default: 1)
`--data-parallel-size-local`	`.Spec.Parallelism.DataLocal`	Local data-parallel ranks (default: 1)
`--tensor-parallel-size`	`.Spec.Parallelism.Tensor`	Tensor parallelism degree
`--enable-expert-parallel`	`.Spec.Parallelism.Expert`	Enable MoE expert parallelism

Usage Examples

# Apply the standard worker configuration
kubectl apply -f config/llmisvcconfig/config-llm-worker-data-parallel.yaml

# Verify the config is created
kubectl get llminferenceserviceconfig kserve-config-llm-worker-data-parallel

Related Pages

Kserve_Kserve_LLM_Decode_Worker_DP_Config - Decode worker configuration for disaggregated inference
Kserve_Kserve_LLM_Prefill_Worker_DP_Config - Prefill worker configuration for disaggregated inference
Kserve_Kserve_LLMInferenceServiceConfig_Minimal_CRD - CRD definition for the LLMInferenceServiceConfig resource

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment