Implementation:Kserve Kserve LLM Decode Template
| Knowledge Sources | |
|---|---|
| Domains | Kubernetes, LLM Serving |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Concrete LLMInferenceServiceConfig template for LLM decode (generation) workers provided by the KServe project.
Description
This file defines the default pod template configuration for LLM decode workers in the disaggregated prefill-decode serving architecture. It specifies an LLMInferenceServiceConfig with a vLLM-based main container (llm-d-cuda:v0.4.0) serving on port 8001 and an llm-d-routing-sidecar init container for request routing on port 8000 using the NixL v2 connector. The template uses Go template syntax for dynamic model name injection and includes shared memory, model cache, and TLS certificate volume mounts. This enables disaggregated serving where decode (token generation) runs separately from prefill (prompt processing).
Usage
This configuration is consumed by the LLMInferenceService controller as the default template for decode worker pods. It is applied to the cluster as part of the LLM serving configuration and is referenced when creating LLMInferenceService resources that use disaggregated prefill-decode architecture.
Code Reference
Source Location
- Repository: Kserve_Kserve
- File: config/llmisvcconfig/config-llm-decode-template.yaml
- Lines: 1-145
Signature
apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceServiceConfig
metadata:
name: kserve-config-llm-decode-template
spec:
template:
containers:
- image: ghcr.io/llm-d/llm-d-cuda:v0.4.0
name: main
ports:
- containerPort: 8001
protocol: TCP
command:
- vllm
- serve
- /mnt/models
args:
- --served-model-name
- "{{ .Spec.Model.Name }}"
- --port
- "8001"
initContainers:
- name: llm-d-routing-sidecar
image: ghcr.io/llm-d/llm-d-routing-sidecar:v0.4.0
restartPolicy: Always
ports:
- containerPort: 8000
protocol: TCP
args:
- "--port=8000"
- "--vllm-port=8001"
- "--connector=nixlv2"
- "--secure-proxy=false"
volumes:
- emptyDir: {}
name: home
- emptyDir:
medium: Memory
sizeLimit: 1Gi
name: dshm
- emptyDir: {}
name: model-cache
- name: tls-certs
secret:
secretName: "{{ ChildName .ObjectMeta.Name `-kserve-self-signed-certs` }}"
Import
kubectl apply -f config/llmisvcconfig/config-llm-decode-template.yaml
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| .Spec.Model.Name | Go template variable | Yes | Model name injected dynamically at reconciliation time |
| .ObjectMeta.Name | Go template variable | Yes | Object name used to derive the TLS secret name |
| INFERENCE_POOL_NAMESPACE | env (fieldRef) | Yes | Namespace of the inference pool, injected into the routing sidecar |
Outputs
| Name | Type | Description |
|---|---|---|
| LLMInferenceServiceConfig | Custom Resource | Decode worker template configuration consumed by the LLMIsvc controller |
| vLLM decode container | Container (port 8001) | Serves the model for token generation (decode phase) |
| Routing sidecar | Init container (port 8000) | Routes requests between prefill and decode workers via NixL v2 |
Usage Examples
Apply the decode template
kubectl apply -f config/llmisvcconfig/config-llm-decode-template.yaml
Verify the config is present
kubectl get llminferenceserviceconfig kserve-config-llm-decode-template