Implementation:Sgl project Sglang Speculative Chat

Knowledge Sources	Sgl_project_Sglang
Domains	Inference, API Integration
Last Updated	2026-02-10 00:00 GMT

Overview

Demonstrates speculative execution for OpenAI chat models, where multiple generation calls within a single assistant turn are speculatively batched to reduce API round trips.

Description

openai_chat_speculative.py showcases SGLang's speculative execution feature with the OpenAI API backend. By using @function(num_api_spec_tokens=256), multiple sgl.gen calls within a single sgl.assistant(...) block are speculatively batched, reducing latency and API costs when extracting multiple structured fields.

The key constraint is that all sgl.gen calls must be within a single sgl.assistant() block with stop tokens on each generation. The script provides several test cases:

gen_character_spec: Speculative single-turn with few-shot examples for accurate speculation
gen_character_spec_no_few_shot: Speculative without few-shot (demonstrates less accurate speculation)
gen_character_normal: Normal non-speculative generation for comparison
multi_turn_question: Multi-turn speculative with structured Q&A format
test_spec_multi_turn_stream: Streaming test (noted as unsupported with speculation)

The script uses GPT-4-turbo as the backend and tracks token usage to demonstrate efficiency gains.

Usage

Use this example when building applications that extract multiple structured fields from a single LLM response using OpenAI models. Speculative execution reduces the number of API calls needed by batching concurrent generation calls.

Code Reference

Source Location

Repository: Sgl_project_Sglang
File: examples/frontend_language/usage/openai_chat_speculative.py
Lines: 1-155

Signature

@function(num_api_spec_tokens=256)
def gen_character_spec(s): ...

@function(num_api_spec_tokens=256)
def gen_character_spec_no_few_shot(s): ...

@function
def gen_character_normal(s): ...

@function(num_api_spec_tokens=1024)
def multi_turn_question(s, question_1, question_2): ...

def test_spec_single_turn(): ...
def test_inaccurate_spec_single_turn(): ...
def test_normal_single_turn(): ...
def test_spec_multi_turn(): ...
def test_spec_multi_turn_stream(): ...

Import

import sglang as sgl
from sglang import OpenAI, function, set_default_backend

I/O Contract

Inputs

Name	Type	Required	Description
OPENAI_API_KEY	env var	Yes	OpenAI API key for authentication
question_1	str	Yes (for multi_turn_question)	First question in multi-turn format
question_2	str	Yes (for multi_turn_question)	Second question in multi-turn format

Outputs

Name	Type	Description
name	str	Generated character name field
birthday	str	Generated character birthday field
job	str	Generated character job field
answer_1	str	Answer to first question (multi-turn)
answer_2	str	Answer to second question (multi-turn)

Usage Examples

import sglang as sgl
from sglang import OpenAI, function, set_default_backend

backend = OpenAI("gpt-4-turbo")
set_default_backend(backend)

# Speculative execution: all gen calls in a single assistant block
@function(num_api_spec_tokens=256)
def gen_character_spec(s):
    s += sgl.system("You are a helpful assistant.")
    s += sgl.user("Construct a character within the following format:")
    s += sgl.assistant(
        "Name: Steve Jobs.\nBirthday: February 24, 1955.\nJob: Apple CEO.\n"
    )
    s += sgl.user("Please generate new Name, Birthday and Job.\n")
    s += sgl.assistant(
        "Name:" + sgl.gen("name", stop="\n")
        + "\nBirthday:" + sgl.gen("birthday", stop="\n")
        + "\nJob:" + sgl.gen("job", stop="\n")
    )

state = gen_character_spec.run()
print("Name:", state["name"])
print("Birthday:", state["birthday"])
print("Job:", state["job"])

Related Pages

Environment:Sgl_project_Sglang_OpenAI

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment