Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Kserve Kserve OpenAI Completions API

From Leeroopedia
Knowledge Sources
Domains LLM_Serving, API_Design, Inference
Last Updated 2026-02-13 00:00 GMT

Overview

Concrete curl-based patterns for sending OpenAI-compatible inference requests to KServe LLMInferenceService endpoints via Envoy Gateway.

Description

vLLM's built-in OpenAI-compatible server provides /v1/completions and /v1/chat/completions endpoints. KServe routes external requests through Envoy Gateway HTTPRoute with URL rewriting. The route URL follows the pattern /<namespace>/<name>/v1/completions which is rewritten to /v1/completions by the HTTPRoute.

Usage

Send HTTP POST requests with JSON body containing model name, prompt, and parameters. The model name must match the one configured in the LLMInferenceService spec.

Code Reference

Source Location

  • Repository: kserve
  • File: docs/samples/llmisvc/single-node-gpu/README.md, Lines 214-226
  • File: config/llmisvcconfig/config-llm-router-route.yaml (HTTPRoute with URL rewrite)

Signature

# Text completion
curl -k https://<route-url>/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "<model-name>", "prompt": "...", "max_tokens": N}'

# Chat completion
curl -k https://<route-url>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "<model-name>", "messages": [{"role": "user", "content": "..."}]}'

Import

# Get route URL
kubectl get route -l serving.kserve.io/inferenceservice=<service-name>

I/O Contract

Inputs

Name Type Required Description
model string Yes Model name (must match LLMInferenceService spec)
prompt string Yes (completions) Text prompt for completion
messages array Yes (chat) Chat message history
max_tokens int No Maximum tokens to generate
temperature float No Sampling temperature (0.0-2.0)

Outputs

Name Type Description
choices array Generated completions with text/message and finish_reason
usage object Token counts: prompt_tokens, completion_tokens, total_tokens

Usage Examples

Text Completion

ROUTE_URL=$(kubectl get route -l serving.kserve.io/inferenceservice=qwen2-7b \
  -o jsonpath='{.items[0].status.url}')

curl -k ${ROUTE_URL}/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "prompt": "What is Kubernetes?",
    "max_tokens": 100,
    "temperature": 0.7
  }'

# Response:
# {
#   "choices": [{
#     "text": "Kubernetes is an open-source container orchestration platform...",
#     "finish_reason": "length"
#   }],
#   "usage": {"prompt_tokens": 5, "completion_tokens": 100, "total_tokens": 105}
# }

Chat Completion

curl -k ${ROUTE_URL}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain KServe in one sentence."}
    ],
    "max_tokens": 50
  }'

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment