Implementation:Kserve Kserve LLMInferenceService CRD Spec
| Knowledge Sources | |
|---|---|
| Domains | LLM_Serving, Kubernetes, GPU_Computing |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Concrete Go type definitions for the LLMInferenceService CRD supporting GPU-accelerated LLM serving with vLLM.
Description
The LLMInferenceService CRD is defined in pkg/apis/serving/v1alpha1/llm_inference_service_types.go. It supports two API versions (v1alpha1 and v1alpha2 hub) with conversion webhooks. The spec includes model URI, workload configuration (replicas, pod template), optional worker/prefill pools, and router configuration.
Usage
Write LLMInferenceService YAML manifests for deploying LLMs. The controller creates vLLM pods, InferencePool, HTTPRoute, and scheduler resources.
Code Reference
Source Location
- Repository: kserve
- File: pkg/apis/serving/v1alpha1/llm_inference_service_types.go, Lines 44-107
- File: pkg/apis/serving/v1alpha2/llm_inference_service_types.go, Lines 43-97 (hub version)
Signature
// LLMInferenceService serves LLMs with GPU acceleration
type LLMInferenceService struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec LLMInferenceServiceSpec `json:"spec,omitempty"`
Status LLMInferenceServiceStatus `json:"status,omitempty"`
}
// LLMInferenceServiceSpec defines the desired LLM serving state
type LLMInferenceServiceSpec struct {
Model LLMModelSpec `json:"model"`
WorkloadSpec `json:",inline"`
Worker *WorkerSpec `json:"worker,omitempty"`
Router *RouterSpec `json:"router,omitempty"`
Prefill *WorkloadSpec `json:"prefill,omitempty"`
}
// LLMModelSpec defines the model to serve
type LLMModelSpec struct {
URI string `json:"uri"`
Name string `json:"name,omitempty"`
}
// WorkloadSpec defines replicas and pod template
type WorkloadSpec struct {
Replicas *int32 `json:"replicas,omitempty"`
Template corev1.PodTemplateSpec `json:"template,omitempty"`
}
Import
import servingv1alpha1 "github.com/kserve/kserve/pkg/apis/serving/v1alpha1"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| spec.model.uri | string | Yes | Model URI: hf://org/model or pvc://pvc-name |
| spec.model.name | string | No | Model display name (defaults from URI) |
| spec.replicas | *int32 | No | Number of decode pool replicas |
| spec.template | PodTemplateSpec | Yes | Pod spec with GPU resources and vLLM container |
| spec.worker | *WorkerSpec | No | Multi-node worker configuration |
| spec.router | *RouterSpec | No | Scheduler, route, gateway config |
| spec.prefill | *WorkloadSpec | No | Disaggregated prefill pool config |
Outputs
| Name | Type | Description |
|---|---|---|
| vLLM Pods | Pods | Running vLLM model serving containers |
| InferencePool | inference.networking.x-k8s.io | Endpoint pool for scheduler |
| HTTPRoute | gateway.networking.k8s.io | Route to the model endpoint |
| Scheduler | Deployment | Endpoint picker for intelligent routing |
| status.conditions | []Condition | LLMInferenceServiceReady |
Usage Examples
Single-Node GPU Deployment
apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
name: qwen2-7b
spec:
model:
uri: "hf://Qwen/Qwen2.5-7B-Instruct"
replicas: 3
template:
spec:
containers:
- name: main
resources:
limits:
cpu: "4"
memory: 32Gi
nvidia.com/gpu: "1"
livenessProbe:
httpGet:
path: /health
port: 8000
scheme: HTTPS
initialDelaySeconds: 120
periodSeconds: 30
router:
scheduler: {}
route: {}
gateway: {}
CPU-Only Deployment
apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
name: opt-125m-cpu
spec:
model:
uri: "hf://facebook/opt-125m"
replicas: 1
template:
spec:
containers:
- name: main
resources:
limits:
cpu: "4"
memory: 8Gi