Principle:Kserve Kserve OpenAI Compatible Inference
| Knowledge Sources | |
|---|---|
| Domains | LLM_Serving, API_Design, Inference |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
An API compatibility pattern that serves LLM inference through OpenAI-compatible REST endpoints, enabling drop-in replacement of OpenAI API calls.
Description
OpenAI-Compatible Inference provides standard REST endpoints that match the OpenAI API specification:
/v1/completions— Text completion/v1/chat/completions— Chat-based completion with message history
vLLM implements these endpoints natively, and KServe routes requests to vLLM pods through Envoy Gateway HTTPRoutes. This enables applications built for the OpenAI API to use self-hosted models without code changes.
Usage
Send standard OpenAI API requests to the LLMInferenceService endpoint. The model name in the request body should match the model name in the LLMInferenceService spec.
Theoretical Basis
# OpenAI API compatibility (NOT implementation code)
Endpoints:
POST /v1/completions
Input: {"model": "<name>", "prompt": "...", "max_tokens": N}
Output: {"choices": [{"text": "...", "finish_reason": "stop"}], "usage": {...}}
POST /v1/chat/completions
Input: {"model": "<name>", "messages": [{"role": "user", "content": "..."}]}
Output: {"choices": [{"message": {"role": "assistant", "content": "..."}}]}
Routing path:
Client → Envoy Gateway → HTTPRoute → InferencePool → vLLM pod
URL rewrite: /<namespace>/<name>/v1/completions → /v1/completions