Implementation:Kserve Kserve LLMInferenceService CRD Spec

Knowledge Sources	KServe KServe LLM Serving
Domains	LLM_Serving, Kubernetes, GPU_Computing
Last Updated	2026-02-13 00:00 GMT

Overview

Concrete Go type definitions for the LLMInferenceService CRD supporting GPU-accelerated LLM serving with vLLM.

Description

The LLMInferenceService CRD is defined in pkg/apis/serving/v1alpha1/llm_inference_service_types.go. It supports two API versions (v1alpha1 and v1alpha2 hub) with conversion webhooks. The spec includes model URI, workload configuration (replicas, pod template), optional worker/prefill pools, and router configuration.

Usage

Write LLMInferenceService YAML manifests for deploying LLMs. The controller creates vLLM pods, InferencePool, HTTPRoute, and scheduler resources.

Code Reference

Source Location

Repository: kserve
File: pkg/apis/serving/v1alpha1/llm_inference_service_types.go, Lines 44-107
File: pkg/apis/serving/v1alpha2/llm_inference_service_types.go, Lines 43-97 (hub version)

Signature

// LLMInferenceService serves LLMs with GPU acceleration
type LLMInferenceService struct {
    metav1.TypeMeta   `json:",inline"`
    metav1.ObjectMeta `json:"metadata,omitempty"`
    Spec              LLMInferenceServiceSpec   `json:"spec,omitempty"`
    Status            LLMInferenceServiceStatus `json:"status,omitempty"`
}

// LLMInferenceServiceSpec defines the desired LLM serving state
type LLMInferenceServiceSpec struct {
    Model    LLMModelSpec             `json:"model"`
    WorkloadSpec `json:",inline"`
    Worker   *WorkerSpec              `json:"worker,omitempty"`
    Router   *RouterSpec              `json:"router,omitempty"`
    Prefill  *WorkloadSpec            `json:"prefill,omitempty"`
}

// LLMModelSpec defines the model to serve
type LLMModelSpec struct {
    URI  string `json:"uri"`
    Name string `json:"name,omitempty"`
}

// WorkloadSpec defines replicas and pod template
type WorkloadSpec struct {
    Replicas *int32                    `json:"replicas,omitempty"`
    Template corev1.PodTemplateSpec    `json:"template,omitempty"`
}

Import

import servingv1alpha1 "github.com/kserve/kserve/pkg/apis/serving/v1alpha1"

I/O Contract

Inputs

Name	Type	Required	Description
spec.model.uri	string	Yes	Model URI: hf://org/model or pvc://pvc-name
spec.model.name	string	No	Model display name (defaults from URI)
spec.replicas	*int32	No	Number of decode pool replicas
spec.template	PodTemplateSpec	Yes	Pod spec with GPU resources and vLLM container
spec.worker	*WorkerSpec	No	Multi-node worker configuration
spec.router	*RouterSpec	No	Scheduler, route, gateway config
spec.prefill	*WorkloadSpec	No	Disaggregated prefill pool config

Outputs

Name	Type	Description
vLLM Pods	Pods	Running vLLM model serving containers
InferencePool	inference.networking.x-k8s.io	Endpoint pool for scheduler
HTTPRoute	gateway.networking.k8s.io	Route to the model endpoint
Scheduler	Deployment	Endpoint picker for intelligent routing
status.conditions	[]Condition	LLMInferenceServiceReady

Usage Examples

Single-Node GPU Deployment

apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
  name: qwen2-7b
spec:
  model:
    uri: "hf://Qwen/Qwen2.5-7B-Instruct"
  replicas: 3
  template:
    spec:
      containers:
        - name: main
          resources:
            limits:
              cpu: "4"
              memory: 32Gi
              nvidia.com/gpu: "1"
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
              scheme: HTTPS
            initialDelaySeconds: 120
            periodSeconds: 30
  router:
    scheduler: {}
    route: {}
    gateway: {}

CPU-Only Deployment

apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
  name: opt-125m-cpu
spec:
  model:
    uri: "hf://facebook/opt-125m"
  replicas: 1
  template:
    spec:
      containers:
        - name: main
          resources:
            limits:
              cpu: "4"
              memory: 8Gi

Related Pages

Implements Principle

Principle:Kserve_Kserve_LLMInferenceService_Specification

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment