Implementation:Kserve Kserve LLM Decode Template

Knowledge Sources	Kserve_Kserve KServe Docs
Domains	Kubernetes, LLM Serving
Last Updated	2026-02-13 00:00 GMT

Overview

Concrete LLMInferenceServiceConfig template for LLM decode (generation) workers provided by the KServe project.

Description

This file defines the default pod template configuration for LLM decode workers in the disaggregated prefill-decode serving architecture. It specifies an LLMInferenceServiceConfig with a vLLM-based main container (llm-d-cuda:v0.4.0) serving on port 8001 and an llm-d-routing-sidecar init container for request routing on port 8000 using the NixL v2 connector. The template uses Go template syntax for dynamic model name injection and includes shared memory, model cache, and TLS certificate volume mounts. This enables disaggregated serving where decode (token generation) runs separately from prefill (prompt processing).

Usage

This configuration is consumed by the LLMInferenceService controller as the default template for decode worker pods. It is applied to the cluster as part of the LLM serving configuration and is referenced when creating LLMInferenceService resources that use disaggregated prefill-decode architecture.

Code Reference

Source Location

Repository: Kserve_Kserve
File: config/llmisvcconfig/config-llm-decode-template.yaml
Lines: 1-145

Signature

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceServiceConfig
metadata:
  name: kserve-config-llm-decode-template
spec:
  template:
    containers:
      - image: ghcr.io/llm-d/llm-d-cuda:v0.4.0
        name: main
        ports:
          - containerPort: 8001
            protocol: TCP
        command:
          - vllm
          - serve
          - /mnt/models
        args:
          - --served-model-name
          - "{{ .Spec.Model.Name }}"
          - --port
          - "8001"
    initContainers:
      - name: llm-d-routing-sidecar
        image: ghcr.io/llm-d/llm-d-routing-sidecar:v0.4.0
        restartPolicy: Always
        ports:
          - containerPort: 8000
            protocol: TCP
        args:
          - "--port=8000"
          - "--vllm-port=8001"
          - "--connector=nixlv2"
          - "--secure-proxy=false"
    volumes:
      - emptyDir: {}
        name: home
      - emptyDir:
          medium: Memory
          sizeLimit: 1Gi
        name: dshm
      - emptyDir: {}
        name: model-cache
      - name: tls-certs
        secret:
          secretName: "{{ ChildName .ObjectMeta.Name `-kserve-self-signed-certs` }}"

Import

kubectl apply -f config/llmisvcconfig/config-llm-decode-template.yaml

I/O Contract

Inputs

Name	Type	Required	Description
.Spec.Model.Name	Go template variable	Yes	Model name injected dynamically at reconciliation time
.ObjectMeta.Name	Go template variable	Yes	Object name used to derive the TLS secret name
INFERENCE_POOL_NAMESPACE	env (fieldRef)	Yes	Namespace of the inference pool, injected into the routing sidecar

Outputs

Name	Type	Description
LLMInferenceServiceConfig	Custom Resource	Decode worker template configuration consumed by the LLMIsvc controller
vLLM decode container	Container (port 8001)	Serves the model for token generation (decode phase)
Routing sidecar	Init container (port 8000)	Routes requests between prefill and decode workers via NixL v2

Usage Examples

Apply the decode template

kubectl apply -f config/llmisvcconfig/config-llm-decode-template.yaml

Verify the config is present

kubectl get llminferenceserviceconfig kserve-config-llm-decode-template

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment