Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Kserve Kserve LLM Worker Configuration

From Leeroopedia
Knowledge Sources
Domains LLM_Serving, Distributed_Computing, Kubernetes
Last Updated 2026-02-13 00:00 GMT

Overview

A pod templating pattern that defines how large language model inference workers are configured, scaled, and differentiated into prefill, decode, and general-purpose roles.

Description

LLM Worker Configuration governs the pod templates used to deploy vLLM inference workers within the KServe LLMInferenceService subsystem. The LLMInferenceServiceConfig custom resource stores cluster-wide default templates for three worker roles:

  • Decode workers -- handle the autoregressive token generation phase, optimized for low-latency sequential output.
  • Prefill workers -- handle the prompt processing (prefill) phase, optimized for high-throughput parallel computation.
  • General workers -- handle both phases in a non-disaggregated deployment.

Each template specifies container images, GPU resource requests, environment variables for vLLM configuration, and data-parallel (DP) scaling parameters. When a user creates an LLMInferenceService, the controller merges the user spec with the appropriate template from LLMInferenceServiceConfig to produce the final pod specifications.

Usage

Use this principle when:

  • Configuring GPU types and counts for LLM inference workers
  • Tuning vLLM parameters (tensor parallelism, max model length, KV cache size)
  • Setting up disaggregated prefill/decode serving architecture
  • Scaling LLM workers with data-parallel replication

Theoretical Basis

# LLM worker template merging flow (NOT implementation code)
LLMInferenceServiceConfig stores default templates:
  decodeSpec:    # Default pod template for decode workers
  prefillSpec:   # Default pod template for prefill workers
  workerSpec:    # Default pod template for general workers
  decodeDataParallelSpec:   # DP scaling config for decode
  prefillDataParallelSpec:  # DP scaling config for prefill
  workerDataParallelSpec:   # DP scaling config for general

Template merging algorithm:
  1. User creates LLMInferenceService with optional overrides
  2. Controller loads matching template from LLMInferenceServiceConfig
  3. User-specified fields override template defaults
  4. Final pod spec computed:
     - Container image (from template or user override)
     - GPU resource requests (from template or user override)
     - vLLM args (merged: template defaults + user additions)
     - Environment variables (merged with user taking precedence)

Data-parallel scaling:
  - DP config specifies replica count per model
  - Each DP replica is a full copy of the model on its own GPU set
  - Load balancer distributes requests across DP replicas
  - Tensor parallelism operates within a single DP replica

Related Pages

Implemented By

Related Principles

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment