Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Kserve Kserve LLMInferenceService Specification

From Leeroopedia
Knowledge Sources
Domains LLM_Serving, Kubernetes, GPU_Computing
Last Updated 2026-02-13 00:00 GMT

Overview

A purpose-built CRD specification for deploying large language models as OpenAI-compatible inference endpoints with GPU scheduling, worker management, and intelligent request routing.

Description

The LLMInferenceService Specification extends KServe's serving capabilities specifically for LLMs. Unlike the general InferenceService, it provides:

  • Model spec: Direct hf:// or pvc:// URI for model artifacts.
  • Workload spec: Replicas, GPU resource requests, pod templates for vLLM containers.
  • Worker spec: Optional multi-node worker pods for tensor/data/expert parallelism.
  • Router spec: Scheduler, route, and gateway configuration for intelligent request routing.
  • Prefill spec: Optional disaggregated prefill pool for KV cache separation.

Usage

Use this instead of InferenceService when deploying LLMs that need:

  • GPU scheduling
  • OpenAI-compatible API endpoints
  • Intelligent request routing (prefix cache, load-aware)
  • Multi-node distributed inference
  • Disaggregated prefill-decode serving

Theoretical Basis

# LLMInferenceService spec model (NOT implementation code)
LLMInferenceService:
  spec:
    model:
      uri: "hf://Qwen/Qwen2.5-7B-Instruct"
      name: "Qwen2.5-7B"
    replicas: 3            # Decode pool replicas
    template:              # Pod template with GPU resources
      containers:
        - resources:
            limits:
              nvidia.com/gpu: "1"
    worker:                # Optional: multi-node workers
      replicas: 4
    router:                # Request routing
      scheduler: {}        # Endpoint picker (prefix cache, load-aware)
      route: {}            # HTTPRoute configuration
      gateway: {}          # Gateway binding
    prefill:               # Optional: disaggregated PD
      replicas: 2

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment