Principle:Kserve Kserve LLMInferenceService Specification
Appearance
| Knowledge Sources | |
|---|---|
| Domains | LLM_Serving, Kubernetes, GPU_Computing |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
A purpose-built CRD specification for deploying large language models as OpenAI-compatible inference endpoints with GPU scheduling, worker management, and intelligent request routing.
Description
The LLMInferenceService Specification extends KServe's serving capabilities specifically for LLMs. Unlike the general InferenceService, it provides:
- Model spec: Direct
hf://orpvc://URI for model artifacts. - Workload spec: Replicas, GPU resource requests, pod templates for vLLM containers.
- Worker spec: Optional multi-node worker pods for tensor/data/expert parallelism.
- Router spec: Scheduler, route, and gateway configuration for intelligent request routing.
- Prefill spec: Optional disaggregated prefill pool for KV cache separation.
Usage
Use this instead of InferenceService when deploying LLMs that need:
- GPU scheduling
- OpenAI-compatible API endpoints
- Intelligent request routing (prefix cache, load-aware)
- Multi-node distributed inference
- Disaggregated prefill-decode serving
Theoretical Basis
# LLMInferenceService spec model (NOT implementation code)
LLMInferenceService:
spec:
model:
uri: "hf://Qwen/Qwen2.5-7B-Instruct"
name: "Qwen2.5-7B"
replicas: 3 # Decode pool replicas
template: # Pod template with GPU resources
containers:
- resources:
limits:
nvidia.com/gpu: "1"
worker: # Optional: multi-node workers
replicas: 4
router: # Request routing
scheduler: {} # Endpoint picker (prefix cache, load-aware)
route: {} # HTTPRoute configuration
gateway: {} # Gateway binding
prefill: # Optional: disaggregated PD
replicas: 2
Related Pages
Implemented By
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment