Implementation:Sgl project Sglang V1 Chat Completions
| Knowledge Sources | |
|---|---|
| Domains | LLM_Serving, API_Design, Chat |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Concrete tool for processing OpenAI-compatible chat completion requests provided by the SGLang HTTP server.
Description
The /v1/chat/completions endpoint is a FastAPI route that accepts ChatCompletionRequest objects (validated via Pydantic), applies chat templates to format the conversation, and routes the request through the SGLang engine for generation. The response follows the OpenAI ChatCompletion schema. SGLang extends the standard with additional parameters like regex for constrained decoding and response_format for JSON schema enforcement.
Usage
Send HTTP POST requests to /v1/chat/completions on a running SGLang server. Use the OpenAI Python SDK or any HTTP client. This endpoint handles both streaming and non-streaming responses.
Code Reference
Source Location
- Repository: sglang
- File: python/sglang/srt/entrypoints/http_server.py
- Lines: L1324-1331 (route handler)
- Request model: python/sglang/srt/entrypoints/openai/protocol.py:L529-627
Signature
# FastAPI route (server-side)
@app.post("/v1/chat/completions")
async def v1_chat_completions(request: ChatCompletionRequest) -> ChatCompletion
# Client-side usage
response = client.chat.completions.create(
model: str,
messages: List[ChatCompletionMessageParam],
temperature: Optional[float] = None,
max_tokens: Optional[int] = None,
max_completion_tokens: Optional[int] = None,
stream: bool = False,
top_p: Optional[float] = None,
response_format: Optional[ResponseFormat] = None,
# SGLang extensions:
regex: Optional[str] = None,
)
Import
# Client side (via OpenAI SDK)
import openai
client = openai.Client(base_url="http://localhost:30000/v1", api_key="EMPTY")
response = client.chat.completions.create(...)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | str | Yes | Model name (default: served model name) |
| messages | List[Dict] | Yes | Conversation messages with "role" and "content" |
| temperature | Optional[float] | No | Sampling temperature |
| max_tokens | Optional[int] | No | Maximum tokens to generate |
| stream | bool | No | Enable streaming (default: False) |
| top_p | Optional[float] | No | Nucleus sampling threshold |
| response_format | Optional[Dict] | No | JSON schema for structured output |
Outputs
| Name | Type | Description |
|---|---|---|
| ChatCompletion | JSON | Response with choices[0].message.content, usage stats |
Usage Examples
Basic Chat
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
],
temperature=0,
max_tokens=64,
)
print(response.choices[0].message.content)
Multi-Turn Conversation
messages = [
{"role": "system", "content": "You are a math tutor."},
{"role": "user", "content": "What is calculus?"},
{"role": "assistant", "content": "Calculus is the study of continuous change..."},
{"role": "user", "content": "Can you give me an example?"},
]
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=messages,
temperature=0.7,
max_tokens=256,
)