Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Sgl project Sglang V1 Chat Completions

From Leeroopedia


Knowledge Sources
Domains LLM_Serving, API_Design, Chat
Last Updated 2026-02-10 00:00 GMT

Overview

Concrete tool for processing OpenAI-compatible chat completion requests provided by the SGLang HTTP server.

Description

The /v1/chat/completions endpoint is a FastAPI route that accepts ChatCompletionRequest objects (validated via Pydantic), applies chat templates to format the conversation, and routes the request through the SGLang engine for generation. The response follows the OpenAI ChatCompletion schema. SGLang extends the standard with additional parameters like regex for constrained decoding and response_format for JSON schema enforcement.

Usage

Send HTTP POST requests to /v1/chat/completions on a running SGLang server. Use the OpenAI Python SDK or any HTTP client. This endpoint handles both streaming and non-streaming responses.

Code Reference

Source Location

  • Repository: sglang
  • File: python/sglang/srt/entrypoints/http_server.py
  • Lines: L1324-1331 (route handler)
  • Request model: python/sglang/srt/entrypoints/openai/protocol.py:L529-627

Signature

# FastAPI route (server-side)
@app.post("/v1/chat/completions")
async def v1_chat_completions(request: ChatCompletionRequest) -> ChatCompletion

# Client-side usage
response = client.chat.completions.create(
    model: str,
    messages: List[ChatCompletionMessageParam],
    temperature: Optional[float] = None,
    max_tokens: Optional[int] = None,
    max_completion_tokens: Optional[int] = None,
    stream: bool = False,
    top_p: Optional[float] = None,
    response_format: Optional[ResponseFormat] = None,
    # SGLang extensions:
    regex: Optional[str] = None,
)

Import

# Client side (via OpenAI SDK)
import openai
client = openai.Client(base_url="http://localhost:30000/v1", api_key="EMPTY")
response = client.chat.completions.create(...)

I/O Contract

Inputs

Name Type Required Description
model str Yes Model name (default: served model name)
messages List[Dict] Yes Conversation messages with "role" and "content"
temperature Optional[float] No Sampling temperature
max_tokens Optional[int] No Maximum tokens to generate
stream bool No Enable streaming (default: False)
top_p Optional[float] No Nucleus sampling threshold
response_format Optional[Dict] No JSON schema for structured output

Outputs

Name Type Description
ChatCompletion JSON Response with choices[0].message.content, usage stats

Usage Examples

Basic Chat

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
    ],
    temperature=0,
    max_tokens=64,
)
print(response.choices[0].message.content)

Multi-Turn Conversation

messages = [
    {"role": "system", "content": "You are a math tutor."},
    {"role": "user", "content": "What is calculus?"},
    {"role": "assistant", "content": "Calculus is the study of continuous change..."},
    {"role": "user", "content": "Can you give me an example?"},
]
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=messages,
    temperature=0.7,
    max_tokens=256,
)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment