Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Kserve Kserve LLMInferenceService CRD Spec

From Leeroopedia
Knowledge Sources
Domains LLM_Serving, Kubernetes, GPU_Computing
Last Updated 2026-02-13 00:00 GMT

Overview

Concrete Go type definitions for the LLMInferenceService CRD supporting GPU-accelerated LLM serving with vLLM.

Description

The LLMInferenceService CRD is defined in pkg/apis/serving/v1alpha1/llm_inference_service_types.go. It supports two API versions (v1alpha1 and v1alpha2 hub) with conversion webhooks. The spec includes model URI, workload configuration (replicas, pod template), optional worker/prefill pools, and router configuration.

Usage

Write LLMInferenceService YAML manifests for deploying LLMs. The controller creates vLLM pods, InferencePool, HTTPRoute, and scheduler resources.

Code Reference

Source Location

  • Repository: kserve
  • File: pkg/apis/serving/v1alpha1/llm_inference_service_types.go, Lines 44-107
  • File: pkg/apis/serving/v1alpha2/llm_inference_service_types.go, Lines 43-97 (hub version)

Signature

// LLMInferenceService serves LLMs with GPU acceleration
type LLMInferenceService struct {
    metav1.TypeMeta   `json:",inline"`
    metav1.ObjectMeta `json:"metadata,omitempty"`
    Spec              LLMInferenceServiceSpec   `json:"spec,omitempty"`
    Status            LLMInferenceServiceStatus `json:"status,omitempty"`
}

// LLMInferenceServiceSpec defines the desired LLM serving state
type LLMInferenceServiceSpec struct {
    Model    LLMModelSpec             `json:"model"`
    WorkloadSpec `json:",inline"`
    Worker   *WorkerSpec              `json:"worker,omitempty"`
    Router   *RouterSpec              `json:"router,omitempty"`
    Prefill  *WorkloadSpec            `json:"prefill,omitempty"`
}

// LLMModelSpec defines the model to serve
type LLMModelSpec struct {
    URI  string `json:"uri"`
    Name string `json:"name,omitempty"`
}

// WorkloadSpec defines replicas and pod template
type WorkloadSpec struct {
    Replicas *int32                    `json:"replicas,omitempty"`
    Template corev1.PodTemplateSpec    `json:"template,omitempty"`
}

Import

import servingv1alpha1 "github.com/kserve/kserve/pkg/apis/serving/v1alpha1"

I/O Contract

Inputs

Name Type Required Description
spec.model.uri string Yes Model URI: hf://org/model or pvc://pvc-name
spec.model.name string No Model display name (defaults from URI)
spec.replicas *int32 No Number of decode pool replicas
spec.template PodTemplateSpec Yes Pod spec with GPU resources and vLLM container
spec.worker *WorkerSpec No Multi-node worker configuration
spec.router *RouterSpec No Scheduler, route, gateway config
spec.prefill *WorkloadSpec No Disaggregated prefill pool config

Outputs

Name Type Description
vLLM Pods Pods Running vLLM model serving containers
InferencePool inference.networking.x-k8s.io Endpoint pool for scheduler
HTTPRoute gateway.networking.k8s.io Route to the model endpoint
Scheduler Deployment Endpoint picker for intelligent routing
status.conditions []Condition LLMInferenceServiceReady

Usage Examples

Single-Node GPU Deployment

apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
  name: qwen2-7b
spec:
  model:
    uri: "hf://Qwen/Qwen2.5-7B-Instruct"
  replicas: 3
  template:
    spec:
      containers:
        - name: main
          resources:
            limits:
              cpu: "4"
              memory: 32Gi
              nvidia.com/gpu: "1"
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
              scheme: HTTPS
            initialDelaySeconds: 120
            periodSeconds: 30
  router:
    scheduler: {}
    route: {}
    gateway: {}

CPU-Only Deployment

apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
  name: opt-125m-cpu
spec:
  model:
    uri: "hf://facebook/opt-125m"
  replicas: 1
  template:
    spec:
      containers:
        - name: main
          resources:
            limits:
              cpu: "4"
              memory: 8Gi

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment