Implementation:Kserve Kserve LLM Decode Worker DP Config

Knowledge Sources	Kserve_Kserve
Domains	Kubernetes, LLM Inference, Data Parallelism, RDMA
Last Updated	2026-02-13 00:00 GMT

Overview

This file defines the pod template configuration for LLM decode workers with data-parallel execution and NIXL v2 connector support for high-performance GPU communication.

Description

Specified as an LLMInferenceServiceConfig (v1alpha2), this configuration includes an llm-d-routing-sidecar init container using the NIXL v2 connector for KV cache transfer, and a main llm-d-cuda container that runs vLLM with an embedded bash startup script. The startup script auto-detects RoCE (RDMA over Converged Ethernet) HCA devices, discovers active mlx5 interfaces, determines the optimal GID index for SR-IOV environments, and configures NCCL/NVSHMEM/UCX InfiniBand settings before launching multi-GPU data-parallel inference via vllm serve.

Usage

Use this configuration as the base template for decode workers in a disaggregated prefill-decode LLM serving architecture. It is referenced by LLMInferenceService resources that require dedicated decode pools with data-parallel execution and RDMA networking.

Code Reference

Source Location

Repository: Kserve_Kserve
File: config/llmisvcconfig/config-llm-decode-worker-data-parallel.yaml

Signature

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceServiceConfig
metadata:
  name: kserve-config-llm-decode-worker-data-parallel
spec:
  template:
    initContainers:
      - name: llm-d-routing-sidecar
        image: ghcr.io/llm-d/llm-d-routing-sidecar:v0.4.0
        restartPolicy: Always
        ports:
          - containerPort: 8000
        args:
          - "--port=8000"
          - "--vllm-port=8001"
          - "--connector=nixlv2"
          - "--secure-proxy=false"
    containers:
      - image: ghcr.io/llm-d/llm-d-cuda:v0.4.0
        name: main
        ports:
          - containerPort: 8001
        command:
          - "/bin/bash"
          - "-c"
        args:
          - |-
            # Auto-detect RoCE HCAs, configure NCCL/NVSHMEM/UCX, launch vllm serve

Import

kubectl apply -f config/llmisvcconfig/config-llm-decode-worker-data-parallel.yaml

I/O Contract

Init Container: llm-d-routing-sidecar

Parameter	Value	Description
`--port`	8000	External-facing proxy port
`--vllm-port`	8001	Internal vLLM engine port
`--connector`	nixlv2	KV cache transfer connector type
`--secure-proxy`	false	TLS disabled (BackendTLSPolicy not yet implemented)

Main Container: llm-d-cuda

Component	Description
Image	`ghcr.io/llm-d/llm-d-cuda:v0.4.0`
Port	8001 (TCP)
Startup Script	Auto-detects RoCE HCAs, configures NCCL_IB_HCA, NVSHMEM_HCA_LIST, UCX_NET_DEVICES, GID indices
vLLM Launch	`vllm serve` with data-parallel, tensor-parallel, and expert-parallel flags

Environment Variables Configured by Startup Script

Variable	Description
`NCCL_IB_HCA`	Comma-separated list of active HCA device names
`NVSHMEM_HCA_LIST`	HCA list for NVSHMEM library
`UCX_NET_DEVICES`	UCX network devices (HCA:port format)
`NCCL_IB_GID_INDEX`	InfiniBand GID index for RoCE v2
`NVSHMEM_IB_GID_INDEX`	GID index for NVSHMEM
`UCX_IB_GID_INDEX`	GID index for UCX

Volume Mounts

Mount Path	Volume	Description
`/home`	home (emptyDir)	Writable home directory
`/dev/shm`	dshm (emptyDir, Memory)	Shared memory for inter-process communication (1Gi)
`/models`	model-cache (emptyDir)	HuggingFace model cache
`/etc/ssl/certs`	tls-certs (secret)	TLS certificates (read-only)

Usage Examples

# Apply the decode worker configuration
kubectl apply -f config/llmisvcconfig/config-llm-decode-worker-data-parallel.yaml

# Verify the config is created
kubectl get llminferenceserviceconfig kserve-config-llm-decode-worker-data-parallel

Related Pages

Kserve_Kserve_LLM_Prefill_Worker_DP_Config - Companion prefill worker configuration for disaggregated inference
Kserve_Kserve_LLM_Worker_DP_Config - Standard (non-disaggregated) worker configuration
Kserve_Kserve_LLMInferenceServiceConfig_Minimal_CRD - CRD definition for the LLMInferenceServiceConfig resource
Kserve_Kserve_DeepSeek_R1_PD_DeepEP_HT_Sample - Sample that uses decode workers in a PD architecture

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment