Principle:Kserve Kserve Pool Tuning
| Knowledge Sources | |
|---|---|
| Domains | Performance_Tuning, LLM_Serving, GPU_Computing |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
A configuration management pattern for independently tuning prefill and decode pool pod templates, resource allocations, and sidecar configurations.
Description
Pool Tuning adjusts the operational characteristics of prefill and decode pools through LLMInferenceServiceConfig templates:
- Prefill template: vLLM direct serving on port 8000, optimized for throughput with batch prompt processing.
- Decode template: vLLM on port 8001 (internal) with a routing sidecar on port 8000 (public). The sidecar (
llm-d-routing-sidecar) handles NixlConnector v2 protocol for KV cache reception.
Independent scaling: spec.replicas controls decode pool size, spec.prefill.replicas controls prefill pool size. Resource allocation (GPU count, memory) can differ between pools.
Usage
Adjust pool sizes and resources based on production traffic patterns. Monitor KV cache utilization, queue depth, and latency metrics to guide tuning decisions.
Theoretical Basis
# Pool tuning dimensions (NOT implementation code)
Prefill pool:
- replicas: scale for prompt throughput
- GPU memory: must fit model + KV cache for batch processing
- vLLM args: batch size, speculative decoding
Decode pool:
- replicas: scale for concurrent generation streams
- GPU memory: must fit model + KV cache for active sequences
- Routing sidecar: NixlConnector v2 for KV cache reception
- vLLM args: max_num_seqs, max_model_len
Scorer weights:
- prefix-cache-scorer: 3 (favor KV reuse)
- kv-cache-utilization-scorer: 2 (avoid overloaded endpoints)
- queue-scorer: 2 (balance request queues)