Principle:Kserve Kserve LLM Worker Configuration
| Knowledge Sources | |
|---|---|
| Domains | LLM_Serving, Distributed_Computing, Kubernetes |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
A pod templating pattern that defines how large language model inference workers are configured, scaled, and differentiated into prefill, decode, and general-purpose roles.
Description
LLM Worker Configuration governs the pod templates used to deploy vLLM inference workers within the KServe LLMInferenceService subsystem. The LLMInferenceServiceConfig custom resource stores cluster-wide default templates for three worker roles:
- Decode workers -- handle the autoregressive token generation phase, optimized for low-latency sequential output.
- Prefill workers -- handle the prompt processing (prefill) phase, optimized for high-throughput parallel computation.
- General workers -- handle both phases in a non-disaggregated deployment.
Each template specifies container images, GPU resource requests, environment variables for vLLM configuration, and data-parallel (DP) scaling parameters. When a user creates an LLMInferenceService, the controller merges the user spec with the appropriate template from LLMInferenceServiceConfig to produce the final pod specifications.
Usage
Use this principle when:
- Configuring GPU types and counts for LLM inference workers
- Tuning vLLM parameters (tensor parallelism, max model length, KV cache size)
- Setting up disaggregated prefill/decode serving architecture
- Scaling LLM workers with data-parallel replication
Theoretical Basis
# LLM worker template merging flow (NOT implementation code)
LLMInferenceServiceConfig stores default templates:
decodeSpec: # Default pod template for decode workers
prefillSpec: # Default pod template for prefill workers
workerSpec: # Default pod template for general workers
decodeDataParallelSpec: # DP scaling config for decode
prefillDataParallelSpec: # DP scaling config for prefill
workerDataParallelSpec: # DP scaling config for general
Template merging algorithm:
1. User creates LLMInferenceService with optional overrides
2. Controller loads matching template from LLMInferenceServiceConfig
3. User-specified fields override template defaults
4. Final pod spec computed:
- Container image (from template or user override)
- GPU resource requests (from template or user override)
- vLLM args (merged: template defaults + user additions)
- Environment variables (merged with user taking precedence)
Data-parallel scaling:
- DP config specifies replica count per model
- Each DP replica is a full copy of the model on its own GPU set
- Load balancer distributes requests across DP replicas
- Tensor parallelism operates within a single DP replica
Related Pages
Implemented By
- Implementation:Kserve_Kserve_LLM_Decode_Template
- Implementation:Kserve_Kserve_LLM_Prefill_Template
- Implementation:Kserve_Kserve_LLM_Worker_Template
- Implementation:Kserve_Kserve_LLM_Decode_Worker_DP_Config
- Implementation:Kserve_Kserve_LLM_Prefill_Worker_DP_Config
- Implementation:Kserve_Kserve_LLM_Worker_DP_Config