Implementation:Kserve Kserve OpenAI Completions API
| Knowledge Sources | |
|---|---|
| Domains | LLM_Serving, API_Design, Inference |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Concrete curl-based patterns for sending OpenAI-compatible inference requests to KServe LLMInferenceService endpoints via Envoy Gateway.
Description
vLLM's built-in OpenAI-compatible server provides /v1/completions and /v1/chat/completions endpoints. KServe routes external requests through Envoy Gateway HTTPRoute with URL rewriting. The route URL follows the pattern /<namespace>/<name>/v1/completions which is rewritten to /v1/completions by the HTTPRoute.
Usage
Send HTTP POST requests with JSON body containing model name, prompt, and parameters. The model name must match the one configured in the LLMInferenceService spec.
Code Reference
Source Location
- Repository: kserve
- File: docs/samples/llmisvc/single-node-gpu/README.md, Lines 214-226
- File: config/llmisvcconfig/config-llm-router-route.yaml (HTTPRoute with URL rewrite)
Signature
# Text completion
curl -k https://<route-url>/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "<model-name>", "prompt": "...", "max_tokens": N}'
# Chat completion
curl -k https://<route-url>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "<model-name>", "messages": [{"role": "user", "content": "..."}]}'
Import
# Get route URL
kubectl get route -l serving.kserve.io/inferenceservice=<service-name>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | string | Yes | Model name (must match LLMInferenceService spec) |
| prompt | string | Yes (completions) | Text prompt for completion |
| messages | array | Yes (chat) | Chat message history |
| max_tokens | int | No | Maximum tokens to generate |
| temperature | float | No | Sampling temperature (0.0-2.0) |
Outputs
| Name | Type | Description |
|---|---|---|
| choices | array | Generated completions with text/message and finish_reason |
| usage | object | Token counts: prompt_tokens, completion_tokens, total_tokens |
Usage Examples
Text Completion
ROUTE_URL=$(kubectl get route -l serving.kserve.io/inferenceservice=qwen2-7b \
-o jsonpath='{.items[0].status.url}')
curl -k ${ROUTE_URL}/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"prompt": "What is Kubernetes?",
"max_tokens": 100,
"temperature": 0.7
}'
# Response:
# {
# "choices": [{
# "text": "Kubernetes is an open-source container orchestration platform...",
# "finish_reason": "length"
# }],
# "usage": {"prompt_tokens": 5, "completion_tokens": 100, "total_tokens": 105}
# }
Chat Completion
curl -k ${ROUTE_URL}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain KServe in one sentence."}
],
"max_tokens": 50
}'