Implementation:Kserve Kserve OpenAI Completions API

Knowledge Sources	KServe vLLM OpenAI Server
Domains	LLM_Serving, API_Design, Inference
Last Updated	2026-02-13 00:00 GMT

Overview

Concrete curl-based patterns for sending OpenAI-compatible inference requests to KServe LLMInferenceService endpoints via Envoy Gateway.

Description

vLLM's built-in OpenAI-compatible server provides /v1/completions and /v1/chat/completions endpoints. KServe routes external requests through Envoy Gateway HTTPRoute with URL rewriting. The route URL follows the pattern /<namespace>/<name>/v1/completions which is rewritten to /v1/completions by the HTTPRoute.

Usage

Send HTTP POST requests with JSON body containing model name, prompt, and parameters. The model name must match the one configured in the LLMInferenceService spec.

Code Reference

Source Location

Repository: kserve
File: docs/samples/llmisvc/single-node-gpu/README.md, Lines 214-226
File: config/llmisvcconfig/config-llm-router-route.yaml (HTTPRoute with URL rewrite)

Signature

# Text completion
curl -k https://<route-url>/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "<model-name>", "prompt": "...", "max_tokens": N}'

# Chat completion
curl -k https://<route-url>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "<model-name>", "messages": [{"role": "user", "content": "..."}]}'

Import

# Get route URL
kubectl get route -l serving.kserve.io/inferenceservice=<service-name>

I/O Contract

Inputs

Name	Type	Required	Description
model	string	Yes	Model name (must match LLMInferenceService spec)
prompt	string	Yes (completions)	Text prompt for completion
messages	array	Yes (chat)	Chat message history
max_tokens	int	No	Maximum tokens to generate
temperature	float	No	Sampling temperature (0.0-2.0)

Outputs

Name	Type	Description
choices	array	Generated completions with text/message and finish_reason
usage	object	Token counts: prompt_tokens, completion_tokens, total_tokens

Usage Examples

Text Completion

ROUTE_URL=$(kubectl get route -l serving.kserve.io/inferenceservice=qwen2-7b \
  -o jsonpath='{.items[0].status.url}')

curl -k ${ROUTE_URL}/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "prompt": "What is Kubernetes?",
    "max_tokens": 100,
    "temperature": 0.7
  }'

# Response:
# {
#   "choices": [{
#     "text": "Kubernetes is an open-source container orchestration platform...",
#     "finish_reason": "length"
#   }],
#   "usage": {"prompt_tokens": 5, "completion_tokens": 100, "total_tokens": 105}
# }

Chat Completion

curl -k ${ROUTE_URL}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain KServe in one sentence."}
    ],
    "max_tokens": 50
  }'

Related Pages

Implements Principle

Principle:Kserve_Kserve_OpenAI_Compatible_Inference

Requires Environment

Environment:Kserve_Kserve_VLLM_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment