Workflow:Kserve Kserve LLM Inference Serving
| Knowledge Sources | |
|---|---|
| Domains | LLM_Serving, Kubernetes, GPU_Inference, Generative_AI |
| Last Updated | 2026-02-13 14:00 GMT |
Overview
End-to-end process for deploying a large language model as a scalable OpenAI-compatible inference endpoint using KServe LLMInferenceService on GPU-equipped Kubernetes clusters.
Description
This workflow covers deploying LLMs (e.g., Qwen2.5-7B, OPT-125M) using the LLMInferenceService custom resource, which is optimized for generative AI workloads. It leverages vLLM as the model server and provides an OpenAI-compatible API. The deployment supports multiple replicas with optional scheduler-based load balancing, prefix cache-aware routing, and autoscaling. Models can be loaded from HuggingFace Hub or pre-downloaded to PersistentVolumes. The LLMInferenceService controller manages worker pod creation, scheduler deployment, and HTTPRoute configuration.
Usage
Execute this workflow when you need to deploy a generative language model for text completion or chat inference on a Kubernetes cluster with GPU nodes. This is the recommended approach for single-node GPU deployments of models that fit in one GPU (e.g., 7B parameter models on a single GPU with sufficient VRAM). Use this workflow for development, testing, or production serving of LLMs with OpenAI-compatible API endpoints.
Execution Steps
Step 1: Prepare GPU cluster and model access
Ensure the Kubernetes cluster has GPU nodes with NVIDIA drivers and device plugins installed. Configure model access by either setting up a HuggingFace ServiceAccount with a token secret for private model download, or pre-downloading model weights to a PersistentVolumeClaim using a download job.
Key considerations:
- GPU nodes must have the NVIDIA device plugin installed
- For HuggingFace models: create a Secret with the HF token and attach to a ServiceAccount
- For PVC-based models: create a PVC and run a download Job to fetch the model weights
- Verify GPU availability with kubectl describe node
Step 2: Install the LLMInferenceService subsystem
Deploy the LLMInferenceService controller, CRDs, webhooks, and ConfigMaps using the provided Kustomize overlays. This includes the LLM scheduler, router templates, and worker pod templates. The subsystem operates alongside the main KServe controller manager.
Key considerations:
- Use the standalone or addons Kustomize overlay depending on cluster setup
- The LLMInferenceService CRD and LLMInferenceServiceConfig CRD must be installed
- ConfigMaps define pod templates for workers, scheduler, and router components
- Cert-manager certificates are required for webhook TLS
Step 3: Write the LLMInferenceService specification
Author the LLMInferenceService YAML manifest specifying the model name, number of replicas, GPU count per replica, and optional scheduler configuration. Choose between deployment modes: with default scheduler (prefix cache routing and load balancing), without scheduler (direct Kubernetes service routing), or with prefill-decode separation.
Key considerations:
- Specify the model name matching the HuggingFace model ID
- Set workerSpec.replicas for the desired number of GPU workers
- Configure tensorParallelSize if the model requires multi-GPU tensor parallelism
- Choose scheduler mode based on routing requirements
Step 4: Apply the LLMInferenceService to Kubernetes
Submit the LLMInferenceService manifest to the cluster. The LLMInferenceService controller reconciles the resource by creating worker StatefulSets, an optional scheduler Deployment, and an HTTPRoute for ingress. The workers download the model and initialize the vLLM engine.
What happens internally:
- Controller creates worker pods from the configured pod template
- Each worker downloads the model and starts the vLLM serving engine
- If scheduler is enabled, a scheduler pod is deployed for intelligent routing
- HTTPRoute is created to expose the inference endpoint
- Gateway Inference Extension CRDs may be used for advanced routing
Step 5: Wait for model loading and readiness
Monitor the LLMInferenceService status and worker pod logs for model loading progress. Large models may require significant time for download and GPU memory allocation. The service transitions to Ready when all worker replicas pass health checks and the vLLM engine is initialized.
Key considerations:
- Initial delay can be substantial (initialDelaySeconds may be set to 4800 for large models)
- Monitor worker pod logs for download and loading progress
- Check LLMInferenceService status with kubectl get llminferenceservice
- Verify all worker pods are in Running state
Step 6: Send inference requests
Send completion or chat requests to the LLMInferenceService endpoint using the OpenAI-compatible API. The endpoint supports /v1/completions for text completion and /v1/chat/completions for chat-style inference. Requests are routed through the scheduler (if enabled) or directly to workers via Kubernetes service load balancing.
Key considerations:
- Completion endpoint: /v1/completions
- Chat endpoint: /v1/chat/completions
- Specify the model name matching the deployed model ID
- Set max_tokens to control generation length
- The API is compatible with OpenAI client libraries