Implementation:Sgl project Sglang Speculative Chat
| Knowledge Sources | |
|---|---|
| Domains | Inference, API Integration |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Demonstrates speculative execution for OpenAI chat models, where multiple generation calls within a single assistant turn are speculatively batched to reduce API round trips.
Description
openai_chat_speculative.py showcases SGLang's speculative execution feature with the OpenAI API backend. By using @function(num_api_spec_tokens=256), multiple sgl.gen calls within a single sgl.assistant(...) block are speculatively batched, reducing latency and API costs when extracting multiple structured fields.
The key constraint is that all sgl.gen calls must be within a single sgl.assistant() block with stop tokens on each generation. The script provides several test cases:
- gen_character_spec: Speculative single-turn with few-shot examples for accurate speculation
- gen_character_spec_no_few_shot: Speculative without few-shot (demonstrates less accurate speculation)
- gen_character_normal: Normal non-speculative generation for comparison
- multi_turn_question: Multi-turn speculative with structured Q&A format
- test_spec_multi_turn_stream: Streaming test (noted as unsupported with speculation)
The script uses GPT-4-turbo as the backend and tracks token usage to demonstrate efficiency gains.
Usage
Use this example when building applications that extract multiple structured fields from a single LLM response using OpenAI models. Speculative execution reduces the number of API calls needed by batching concurrent generation calls.
Code Reference
Source Location
- Repository: Sgl_project_Sglang
- File: examples/frontend_language/usage/openai_chat_speculative.py
- Lines: 1-155
Signature
@function(num_api_spec_tokens=256)
def gen_character_spec(s): ...
@function(num_api_spec_tokens=256)
def gen_character_spec_no_few_shot(s): ...
@function
def gen_character_normal(s): ...
@function(num_api_spec_tokens=1024)
def multi_turn_question(s, question_1, question_2): ...
def test_spec_single_turn(): ...
def test_inaccurate_spec_single_turn(): ...
def test_normal_single_turn(): ...
def test_spec_multi_turn(): ...
def test_spec_multi_turn_stream(): ...
Import
import sglang as sgl
from sglang import OpenAI, function, set_default_backend
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| OPENAI_API_KEY | env var | Yes | OpenAI API key for authentication |
| question_1 | str | Yes (for multi_turn_question) | First question in multi-turn format |
| question_2 | str | Yes (for multi_turn_question) | Second question in multi-turn format |
Outputs
| Name | Type | Description |
|---|---|---|
| name | str | Generated character name field |
| birthday | str | Generated character birthday field |
| job | str | Generated character job field |
| answer_1 | str | Answer to first question (multi-turn) |
| answer_2 | str | Answer to second question (multi-turn) |
Usage Examples
import sglang as sgl
from sglang import OpenAI, function, set_default_backend
backend = OpenAI("gpt-4-turbo")
set_default_backend(backend)
# Speculative execution: all gen calls in a single assistant block
@function(num_api_spec_tokens=256)
def gen_character_spec(s):
s += sgl.system("You are a helpful assistant.")
s += sgl.user("Construct a character within the following format:")
s += sgl.assistant(
"Name: Steve Jobs.\nBirthday: February 24, 1955.\nJob: Apple CEO.\n"
)
s += sgl.user("Please generate new Name, Birthday and Job.\n")
s += sgl.assistant(
"Name:" + sgl.gen("name", stop="\n")
+ "\nBirthday:" + sgl.gen("birthday", stop="\n")
+ "\nJob:" + sgl.gen("job", stop="\n")
)
state = gen_character_spec.run()
print("Name:", state["name"])
print("Birthday:", state["birthday"])
print("Job:", state["job"])