Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Kserve Kserve LLM Decode Worker DP Config

From Leeroopedia
Knowledge Sources
Domains Kubernetes, LLM Inference, Data Parallelism, RDMA
Last Updated 2026-02-13 00:00 GMT

Overview

This file defines the pod template configuration for LLM decode workers with data-parallel execution and NIXL v2 connector support for high-performance GPU communication.

Description

Specified as an LLMInferenceServiceConfig (v1alpha2), this configuration includes an llm-d-routing-sidecar init container using the NIXL v2 connector for KV cache transfer, and a main llm-d-cuda container that runs vLLM with an embedded bash startup script. The startup script auto-detects RoCE (RDMA over Converged Ethernet) HCA devices, discovers active mlx5 interfaces, determines the optimal GID index for SR-IOV environments, and configures NCCL/NVSHMEM/UCX InfiniBand settings before launching multi-GPU data-parallel inference via vllm serve.

Usage

Use this configuration as the base template for decode workers in a disaggregated prefill-decode LLM serving architecture. It is referenced by LLMInferenceService resources that require dedicated decode pools with data-parallel execution and RDMA networking.

Code Reference

Source Location

Signature

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceServiceConfig
metadata:
  name: kserve-config-llm-decode-worker-data-parallel
spec:
  template:
    initContainers:
      - name: llm-d-routing-sidecar
        image: ghcr.io/llm-d/llm-d-routing-sidecar:v0.4.0
        restartPolicy: Always
        ports:
          - containerPort: 8000
        args:
          - "--port=8000"
          - "--vllm-port=8001"
          - "--connector=nixlv2"
          - "--secure-proxy=false"
    containers:
      - image: ghcr.io/llm-d/llm-d-cuda:v0.4.0
        name: main
        ports:
          - containerPort: 8001
        command:
          - "/bin/bash"
          - "-c"
        args:
          - |-
            # Auto-detect RoCE HCAs, configure NCCL/NVSHMEM/UCX, launch vllm serve

Import

kubectl apply -f config/llmisvcconfig/config-llm-decode-worker-data-parallel.yaml

I/O Contract

Init Container: llm-d-routing-sidecar

Parameter Value Description
--port 8000 External-facing proxy port
--vllm-port 8001 Internal vLLM engine port
--connector nixlv2 KV cache transfer connector type
--secure-proxy false TLS disabled (BackendTLSPolicy not yet implemented)

Main Container: llm-d-cuda

Component Description
Image ghcr.io/llm-d/llm-d-cuda:v0.4.0
Port 8001 (TCP)
Startup Script Auto-detects RoCE HCAs, configures NCCL_IB_HCA, NVSHMEM_HCA_LIST, UCX_NET_DEVICES, GID indices
vLLM Launch vllm serve with data-parallel, tensor-parallel, and expert-parallel flags

Environment Variables Configured by Startup Script

Variable Description
NCCL_IB_HCA Comma-separated list of active HCA device names
NVSHMEM_HCA_LIST HCA list for NVSHMEM library
UCX_NET_DEVICES UCX network devices (HCA:port format)
NCCL_IB_GID_INDEX InfiniBand GID index for RoCE v2
NVSHMEM_IB_GID_INDEX GID index for NVSHMEM
UCX_IB_GID_INDEX GID index for UCX

Volume Mounts

Mount Path Volume Description
/home home (emptyDir) Writable home directory
/dev/shm dshm (emptyDir, Memory) Shared memory for inter-process communication (1Gi)
/models model-cache (emptyDir) HuggingFace model cache
/etc/ssl/certs tls-certs (secret) TLS certificates (read-only)

Usage Examples

# Apply the decode worker configuration
kubectl apply -f config/llmisvcconfig/config-llm-decode-worker-data-parallel.yaml

# Verify the config is created
kubectl get llminferenceserviceconfig kserve-config-llm-decode-worker-data-parallel

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment