Implementation:Sgl project Sglang V1 Chat Completions

Knowledge Sources	SGLang
Domains	LLM_Serving, API_Design, Chat
Last Updated	2026-02-10 00:00 GMT

Overview

Concrete tool for processing OpenAI-compatible chat completion requests provided by the SGLang HTTP server.

Description

The /v1/chat/completions endpoint is a FastAPI route that accepts ChatCompletionRequest objects (validated via Pydantic), applies chat templates to format the conversation, and routes the request through the SGLang engine for generation. The response follows the OpenAI ChatCompletion schema. SGLang extends the standard with additional parameters like regex for constrained decoding and response_format for JSON schema enforcement.

Usage

Send HTTP POST requests to /v1/chat/completions on a running SGLang server. Use the OpenAI Python SDK or any HTTP client. This endpoint handles both streaming and non-streaming responses.

Code Reference

Source Location

Repository: sglang
File: python/sglang/srt/entrypoints/http_server.py
Lines: L1324-1331 (route handler)
Request model: python/sglang/srt/entrypoints/openai/protocol.py:L529-627

Signature

# FastAPI route (server-side)
@app.post("/v1/chat/completions")
async def v1_chat_completions(request: ChatCompletionRequest) -> ChatCompletion

# Client-side usage
response = client.chat.completions.create(
    model: str,
    messages: List[ChatCompletionMessageParam],
    temperature: Optional[float] = None,
    max_tokens: Optional[int] = None,
    max_completion_tokens: Optional[int] = None,
    stream: bool = False,
    top_p: Optional[float] = None,
    response_format: Optional[ResponseFormat] = None,
    # SGLang extensions:
    regex: Optional[str] = None,
)

Import

# Client side (via OpenAI SDK)
import openai
client = openai.Client(base_url="http://localhost:30000/v1", api_key="EMPTY")
response = client.chat.completions.create(...)

I/O Contract

Inputs

Name	Type	Required	Description
model	str	Yes	Model name (default: served model name)
messages	List[Dict]	Yes	Conversation messages with "role" and "content"
temperature	Optional[float]	No	Sampling temperature
max_tokens	Optional[int]	No	Maximum tokens to generate
stream	bool	No	Enable streaming (default: False)
top_p	Optional[float]	No	Nucleus sampling threshold
response_format	Optional[Dict]	No	JSON schema for structured output

Outputs

Name	Type	Description
ChatCompletion	JSON	Response with choices[0].message.content, usage stats

Usage Examples

Basic Chat

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
    ],
    temperature=0,
    max_tokens=64,
)
print(response.choices[0].message.content)

Multi-Turn Conversation

messages = [
    {"role": "system", "content": "You are a math tutor."},
    {"role": "user", "content": "What is calculus?"},
    {"role": "assistant", "content": "Calculus is the study of continuous change..."},
    {"role": "user", "content": "Can you give me an example?"},
]
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=messages,
    temperature=0.7,
    max_tokens=256,
)

Related Pages

Implements Principle

Principle:Sgl_project_Sglang_Chat_Completion_API

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment