Principle:Kserve Kserve Gateway Inference Extension
| Knowledge Sources | |
|---|---|
| Domains | Networking, LLM_Serving, Kubernetes |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
An extension to the Kubernetes Gateway API that introduces inference-aware resource types for intelligent request routing to large language model serving pools.
Description
Gateway Inference Extension defines custom resource types -- InferenceObjective and InferencePool -- that extend the standard Kubernetes Gateway API to support LLM-specific routing semantics. Unlike traditional HTTP routing which selects backends based on host/path matching alone, inference-aware routing considers model-level attributes such as which model is loaded, request priority, and backend health metrics.
An InferenceObjective declares the desired routing policy for a model (e.g., target pool, priority level), while an InferencePool defines a set of backend pods serving a particular model with specific selection criteria. The Gateway controller and KServe scheduler work together: the HTTPRoute directs traffic to a pool, and the scheduler within the pool applies intelligent endpoint selection based on queue depth, KV cache utilization, and request characteristics.
Usage
Use this principle when:
- Deploying LLM inference services that require model-aware load balancing
- Integrating KServe with Gateway API instead of Istio VirtualService
- Configuring priority-based request routing for multi-tenant LLM deployments
- Building disaggregated prefill/decode architectures that need intelligent request steering
Theoretical Basis
# Gateway inference extension routing flow (NOT implementation code)
Resource hierarchy:
Gateway → HTTPRoute → InferencePool → Backend Pods
↑
InferenceObjective (policy attachment)
InferenceObjective:
spec:
poolRef: reference to an InferencePool
priority: routing priority level
InferencePool:
spec:
targetPortNumber: port on backend pods
selector: label selector for backend pods
endpointPickerConfig: scheduler configuration
Request flow:
1. Client sends request to Gateway
2. HTTPRoute matches host/path and forwards to InferencePool backend
3. InferencePool's endpoint picker (scheduler) selects optimal pod
4. Selection criteria: queue depth, KV cache usage, model loaded
5. Request forwarded to selected pod
6. Response returned through the Gateway