Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Kserve Kserve LLM Inference Serving

From Leeroopedia
Knowledge Sources
Domains LLM_Serving, Kubernetes, GPU_Inference, Generative_AI
Last Updated 2026-02-13 14:00 GMT

Overview

End-to-end process for deploying a large language model as a scalable OpenAI-compatible inference endpoint using KServe LLMInferenceService on GPU-equipped Kubernetes clusters.

Description

This workflow covers deploying LLMs (e.g., Qwen2.5-7B, OPT-125M) using the LLMInferenceService custom resource, which is optimized for generative AI workloads. It leverages vLLM as the model server and provides an OpenAI-compatible API. The deployment supports multiple replicas with optional scheduler-based load balancing, prefix cache-aware routing, and autoscaling. Models can be loaded from HuggingFace Hub or pre-downloaded to PersistentVolumes. The LLMInferenceService controller manages worker pod creation, scheduler deployment, and HTTPRoute configuration.

Usage

Execute this workflow when you need to deploy a generative language model for text completion or chat inference on a Kubernetes cluster with GPU nodes. This is the recommended approach for single-node GPU deployments of models that fit in one GPU (e.g., 7B parameter models on a single GPU with sufficient VRAM). Use this workflow for development, testing, or production serving of LLMs with OpenAI-compatible API endpoints.

Execution Steps

Step 1: Prepare GPU cluster and model access

Ensure the Kubernetes cluster has GPU nodes with NVIDIA drivers and device plugins installed. Configure model access by either setting up a HuggingFace ServiceAccount with a token secret for private model download, or pre-downloading model weights to a PersistentVolumeClaim using a download job.

Key considerations:

  • GPU nodes must have the NVIDIA device plugin installed
  • For HuggingFace models: create a Secret with the HF token and attach to a ServiceAccount
  • For PVC-based models: create a PVC and run a download Job to fetch the model weights
  • Verify GPU availability with kubectl describe node

Step 2: Install the LLMInferenceService subsystem

Deploy the LLMInferenceService controller, CRDs, webhooks, and ConfigMaps using the provided Kustomize overlays. This includes the LLM scheduler, router templates, and worker pod templates. The subsystem operates alongside the main KServe controller manager.

Key considerations:

  • Use the standalone or addons Kustomize overlay depending on cluster setup
  • The LLMInferenceService CRD and LLMInferenceServiceConfig CRD must be installed
  • ConfigMaps define pod templates for workers, scheduler, and router components
  • Cert-manager certificates are required for webhook TLS

Step 3: Write the LLMInferenceService specification

Author the LLMInferenceService YAML manifest specifying the model name, number of replicas, GPU count per replica, and optional scheduler configuration. Choose between deployment modes: with default scheduler (prefix cache routing and load balancing), without scheduler (direct Kubernetes service routing), or with prefill-decode separation.

Key considerations:

  • Specify the model name matching the HuggingFace model ID
  • Set workerSpec.replicas for the desired number of GPU workers
  • Configure tensorParallelSize if the model requires multi-GPU tensor parallelism
  • Choose scheduler mode based on routing requirements

Step 4: Apply the LLMInferenceService to Kubernetes

Submit the LLMInferenceService manifest to the cluster. The LLMInferenceService controller reconciles the resource by creating worker StatefulSets, an optional scheduler Deployment, and an HTTPRoute for ingress. The workers download the model and initialize the vLLM engine.

What happens internally:

  • Controller creates worker pods from the configured pod template
  • Each worker downloads the model and starts the vLLM serving engine
  • If scheduler is enabled, a scheduler pod is deployed for intelligent routing
  • HTTPRoute is created to expose the inference endpoint
  • Gateway Inference Extension CRDs may be used for advanced routing

Step 5: Wait for model loading and readiness

Monitor the LLMInferenceService status and worker pod logs for model loading progress. Large models may require significant time for download and GPU memory allocation. The service transitions to Ready when all worker replicas pass health checks and the vLLM engine is initialized.

Key considerations:

  • Initial delay can be substantial (initialDelaySeconds may be set to 4800 for large models)
  • Monitor worker pod logs for download and loading progress
  • Check LLMInferenceService status with kubectl get llminferenceservice
  • Verify all worker pods are in Running state

Step 6: Send inference requests

Send completion or chat requests to the LLMInferenceService endpoint using the OpenAI-compatible API. The endpoint supports /v1/completions for text completion and /v1/chat/completions for chat-style inference. Requests are routed through the scheduler (if enabled) or directly to workers via Kubernetes service load balancing.

Key considerations:

  • Completion endpoint: /v1/completions
  • Chat endpoint: /v1/chat/completions
  • Specify the model name matching the deployed model ID
  • Set max_tokens to control generation length
  • The API is compatible with OpenAI client libraries

Execution Diagram

GitHub URL

Workflow Repository