Principle:Kserve Kserve Prefill Decode Specification
Appearance
| Knowledge Sources | |
|---|---|
| Domains | LLM_Serving, Distributed_Systems, GPU_Computing |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
A disaggregated inference architecture that separates the prefill (prompt processing) and decode (token generation) phases onto independent GPU pools for optimized throughput and latency.
Description
Prefill-Decode Specification enables a separation of concerns in LLM serving:
- Prefill pool: Processes the input prompt, computing the KV cache. This is compute-bound and benefits from high GPU utilization.
- Decode pool: Generates output tokens autoregressively using the transferred KV cache. This is memory-bound and latency-sensitive.
By separating these phases, each pool can be independently scaled and optimized. The KV cache is transferred between pools using NixlConnector over RDMA for minimal latency.
Usage
Use disaggregated PD serving when:
- Prompt processing latency is not critical but token generation latency is
- Prefill and decode have different scaling patterns
- The model fits in GPU memory on single nodes
Theoretical Basis
# Prefill-Decode separation (NOT implementation code)
Standard LLM inference:
[Prompt] → [Prefill: compute KV cache] → [Decode: generate tokens]
Single pool handles both phases sequentially
Disaggregated PD:
[Prompt] → [Prefill Pool: compute KV cache]
↓ KV transfer (RDMA/NixlConnector)
[Decode Pool: generate tokens using transferred KV]
Benefits:
- Prefill pool optimized for throughput (batch prompts)
- Decode pool optimized for latency (fast token generation)
- Independent scaling: prefill_replicas != decode_replicas
Related Pages
Implemented By
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment