Principle:Kserve Kserve Pool Tuning

Knowledge Sources	vLLM Performance KServe LLM Serving
Domains	Performance_Tuning, LLM_Serving, GPU_Computing
Last Updated	2026-02-13 00:00 GMT

Overview

A configuration management pattern for independently tuning prefill and decode pool pod templates, resource allocations, and sidecar configurations.

Description

Pool Tuning adjusts the operational characteristics of prefill and decode pools through LLMInferenceServiceConfig templates:

Prefill template: vLLM direct serving on port 8000, optimized for throughput with batch prompt processing.
Decode template: vLLM on port 8001 (internal) with a routing sidecar on port 8000 (public). The sidecar (llm-d-routing-sidecar) handles NixlConnector v2 protocol for KV cache reception.

Independent scaling: spec.replicas controls decode pool size, spec.prefill.replicas controls prefill pool size. Resource allocation (GPU count, memory) can differ between pools.

Usage

Adjust pool sizes and resources based on production traffic patterns. Monitor KV cache utilization, queue depth, and latency metrics to guide tuning decisions.

Theoretical Basis

# Pool tuning dimensions (NOT implementation code)
Prefill pool:
  - replicas: scale for prompt throughput
  - GPU memory: must fit model + KV cache for batch processing
  - vLLM args: batch size, speculative decoding

Decode pool:
  - replicas: scale for concurrent generation streams
  - GPU memory: must fit model + KV cache for active sequences
  - Routing sidecar: NixlConnector v2 for KV cache reception
  - vLLM args: max_num_seqs, max_model_len

Scorer weights:
  - prefix-cache-scorer: 3 (favor KV reuse)
  - kv-cache-utilization-scorer: 2 (avoid overloaded endpoints)
  - queue-scorer: 2 (balance request queues)

Related Pages

Implemented By

Implementation:Kserve_Kserve_PD_Pool_Templates

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment