Principle:Kserve Kserve LLMInferenceService Specification

Knowledge Sources	KServe LLM Serving vLLM
Domains	LLM_Serving, Kubernetes, GPU_Computing
Last Updated	2026-02-13 00:00 GMT

Overview

A purpose-built CRD specification for deploying large language models as OpenAI-compatible inference endpoints with GPU scheduling, worker management, and intelligent request routing.

Description

The LLMInferenceService Specification extends KServe's serving capabilities specifically for LLMs. Unlike the general InferenceService, it provides:

Model spec: Direct hf:// or pvc:// URI for model artifacts.
Workload spec: Replicas, GPU resource requests, pod templates for vLLM containers.
Worker spec: Optional multi-node worker pods for tensor/data/expert parallelism.
Router spec: Scheduler, route, and gateway configuration for intelligent request routing.
Prefill spec: Optional disaggregated prefill pool for KV cache separation.

Usage

Use this instead of InferenceService when deploying LLMs that need:

GPU scheduling
OpenAI-compatible API endpoints
Intelligent request routing (prefix cache, load-aware)
Multi-node distributed inference
Disaggregated prefill-decode serving

Theoretical Basis

# LLMInferenceService spec model (NOT implementation code)
LLMInferenceService:
  spec:
    model:
      uri: "hf://Qwen/Qwen2.5-7B-Instruct"
      name: "Qwen2.5-7B"
    replicas: 3            # Decode pool replicas
    template:              # Pod template with GPU resources
      containers:
        - resources:
            limits:
              nvidia.com/gpu: "1"
    worker:                # Optional: multi-node workers
      replicas: 4
    router:                # Request routing
      scheduler: {}        # Endpoint picker (prefix cache, load-aware)
      route: {}            # HTTPRoute configuration
      gateway: {}          # Gateway binding
    prefill:               # Optional: disaggregated PD
      replicas: 2

Related Pages

Implemented By

Implementation:Kserve_Kserve_LLMInferenceService_CRD_Spec

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment