Principle:Kserve Kserve Gateway Inference Extension

Knowledge Sources	Kserve_Kserve KServe Docs Gateway API Gateway API Inference Extension
Domains	Networking, LLM_Serving, Kubernetes
Last Updated	2026-02-13 00:00 GMT

Overview

An extension to the Kubernetes Gateway API that introduces inference-aware resource types for intelligent request routing to large language model serving pools.

Description

Gateway Inference Extension defines custom resource types -- InferenceObjective and InferencePool -- that extend the standard Kubernetes Gateway API to support LLM-specific routing semantics. Unlike traditional HTTP routing which selects backends based on host/path matching alone, inference-aware routing considers model-level attributes such as which model is loaded, request priority, and backend health metrics.

An InferenceObjective declares the desired routing policy for a model (e.g., target pool, priority level), while an InferencePool defines a set of backend pods serving a particular model with specific selection criteria. The Gateway controller and KServe scheduler work together: the HTTPRoute directs traffic to a pool, and the scheduler within the pool applies intelligent endpoint selection based on queue depth, KV cache utilization, and request characteristics.

Usage

Use this principle when:

Deploying LLM inference services that require model-aware load balancing
Integrating KServe with Gateway API instead of Istio VirtualService
Configuring priority-based request routing for multi-tenant LLM deployments
Building disaggregated prefill/decode architectures that need intelligent request steering

Theoretical Basis

# Gateway inference extension routing flow (NOT implementation code)
Resource hierarchy:
  Gateway → HTTPRoute → InferencePool → Backend Pods
                  ↑
          InferenceObjective (policy attachment)

InferenceObjective:
  spec:
    poolRef: reference to an InferencePool
    priority: routing priority level

InferencePool:
  spec:
    targetPortNumber: port on backend pods
    selector: label selector for backend pods
    endpointPickerConfig: scheduler configuration

Request flow:
  1. Client sends request to Gateway
  2. HTTPRoute matches host/path and forwards to InferencePool backend
  3. InferencePool's endpoint picker (scheduler) selects optimal pod
  4. Selection criteria: queue depth, KV cache usage, model loaded
  5. Request forwarded to selected pod
  6. Response returned through the Gateway

Related Pages

Implemented By

Implementation:Kserve_Kserve_Gateway_Inference_Extension_CRDs

Related Principles

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment