Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Kserve Kserve LLM Worker DP Config

From Leeroopedia
Knowledge Sources
Domains Kubernetes, LLM Inference, Data Parallelism, RDMA
Last Updated 2026-02-13 00:00 GMT

Overview

This file defines the pod template configuration for standard (non-disaggregated) LLM workers with data-parallel execution and automatic RoCE/InfiniBand network discovery.

Description

Specified as an LLMInferenceServiceConfig (v1alpha2), this configuration provides a unified worker template that handles both prefill and decode phases within the same pod, as opposed to the disaggregated prefill/decode approach. The main llm-d-cuda container includes the same RoCE auto-detection bash startup script, which discovers active mlx5 HCA devices, determines optimal GID indices for SR-IOV environments, and configures NCCL/NVSHMEM/UCX InfiniBand settings before launching multi-GPU data-parallel inference.

Usage

Use this configuration as the base template for LLM workers in a unified (non-disaggregated) serving architecture. This is the simpler deployment pattern where each worker handles the full inference pipeline without separating prefill and decode phases.

Code Reference

Source Location

Signature

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceServiceConfig
metadata:
  name: kserve-config-llm-worker-data-parallel
spec:
  template:
    containers:
      - image: ghcr.io/llm-d/llm-d-cuda:v0.4.0
        imagePullPolicy: IfNotPresent
        name: main
        ports:
          - containerPort: 8000
            protocol: TCP
        command:
          - "/bin/bash"
          - "-c"
        args:
          - |-
            # Auto-detect RoCE HCAs, configure NCCL/NVSHMEM/UCX, launch vllm serve

Import

kubectl apply -f config/llmisvcconfig/config-llm-worker-data-parallel.yaml

I/O Contract

Main Container: llm-d-cuda

Component Description
Image ghcr.io/llm-d/llm-d-cuda:v0.4.0
Port 8000 (TCP)
Startup Script Auto-detects RoCE HCAs, configures NCCL_IB_HCA, NVSHMEM_HCA_LIST, UCX_NET_DEVICES, GID indices
vLLM Launch vllm serve with data-parallel, tensor-parallel, and expert-parallel flags

Comparison with Disaggregated Configs

Aspect Standard Worker Decode Worker Prefill Worker
Config Name kserve-config-llm-worker-data-parallel kserve-config-llm-decode-worker-data-parallel kserve-config-llm-prefill-worker-data-parallel
Config Section spec.template spec.template spec.prefill.template
Routing Sidecar Not included Included (NIXL v2) Not included
Role Full inference pipeline Token generation only Prompt processing only

Environment Variables Configured by Startup Script

Variable Description
NCCL_IB_HCA Comma-separated list of active HCA device names
NVSHMEM_HCA_LIST HCA list for NVSHMEM library
UCX_NET_DEVICES UCX network devices (HCA:port format)
NCCL_IB_GID_INDEX InfiniBand GID index for RoCE v2
NVSHMEM_IB_GID_INDEX GID index for NVSHMEM
UCX_IB_GID_INDEX GID index for UCX

vLLM Serve Command Template

Flag Source Description
--served-model-name .Spec.Model.Name Model name from the LLMInferenceService spec
--port hardcoded 8000
--data-parallel-size .Spec.Parallelism.Data Number of data-parallel ranks (default: 1)
--data-parallel-size-local .Spec.Parallelism.DataLocal Local data-parallel ranks (default: 1)
--tensor-parallel-size .Spec.Parallelism.Tensor Tensor parallelism degree
--enable-expert-parallel .Spec.Parallelism.Expert Enable MoE expert parallelism

Usage Examples

# Apply the standard worker configuration
kubectl apply -f config/llmisvcconfig/config-llm-worker-data-parallel.yaml

# Verify the config is created
kubectl get llminferenceserviceconfig kserve-config-llm-worker-data-parallel

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment