Principle:Kserve Kserve Prefill Decode Specification

Knowledge Sources	Splitwise: Efficient Generative LLM Inference vLLM Disaggregated Prefill
Domains	LLM_Serving, Distributed_Systems, GPU_Computing
Last Updated	2026-02-13 00:00 GMT

Overview

A disaggregated inference architecture that separates the prefill (prompt processing) and decode (token generation) phases onto independent GPU pools for optimized throughput and latency.

Description

Prefill-Decode Specification enables a separation of concerns in LLM serving:

Prefill pool: Processes the input prompt, computing the KV cache. This is compute-bound and benefits from high GPU utilization.
Decode pool: Generates output tokens autoregressively using the transferred KV cache. This is memory-bound and latency-sensitive.

By separating these phases, each pool can be independently scaled and optimized. The KV cache is transferred between pools using NixlConnector over RDMA for minimal latency.

Usage

Use disaggregated PD serving when:

Prompt processing latency is not critical but token generation latency is
Prefill and decode have different scaling patterns
The model fits in GPU memory on single nodes

Theoretical Basis

# Prefill-Decode separation (NOT implementation code)
Standard LLM inference:
  [Prompt] → [Prefill: compute KV cache] → [Decode: generate tokens]
  Single pool handles both phases sequentially

Disaggregated PD:
  [Prompt] → [Prefill Pool: compute KV cache]
                    ↓ KV transfer (RDMA/NixlConnector)
             [Decode Pool: generate tokens using transferred KV]

Benefits:
  - Prefill pool optimized for throughput (batch prompts)
  - Decode pool optimized for latency (fast token generation)
  - Independent scaling: prefill_replicas != decode_replicas

Related Pages

Implemented By

Implementation:Kserve_Kserve_PD_LLMInferenceService_Spec

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment