Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Sgl project Sglang Speculative Chat

From Leeroopedia


Knowledge Sources
Domains Inference, API Integration
Last Updated 2026-02-10 00:00 GMT

Overview

Demonstrates speculative execution for OpenAI chat models, where multiple generation calls within a single assistant turn are speculatively batched to reduce API round trips.

Description

openai_chat_speculative.py showcases SGLang's speculative execution feature with the OpenAI API backend. By using @function(num_api_spec_tokens=256), multiple sgl.gen calls within a single sgl.assistant(...) block are speculatively batched, reducing latency and API costs when extracting multiple structured fields.

The key constraint is that all sgl.gen calls must be within a single sgl.assistant() block with stop tokens on each generation. The script provides several test cases:

  • gen_character_spec: Speculative single-turn with few-shot examples for accurate speculation
  • gen_character_spec_no_few_shot: Speculative without few-shot (demonstrates less accurate speculation)
  • gen_character_normal: Normal non-speculative generation for comparison
  • multi_turn_question: Multi-turn speculative with structured Q&A format
  • test_spec_multi_turn_stream: Streaming test (noted as unsupported with speculation)

The script uses GPT-4-turbo as the backend and tracks token usage to demonstrate efficiency gains.

Usage

Use this example when building applications that extract multiple structured fields from a single LLM response using OpenAI models. Speculative execution reduces the number of API calls needed by batching concurrent generation calls.

Code Reference

Source Location

Signature

@function(num_api_spec_tokens=256)
def gen_character_spec(s): ...

@function(num_api_spec_tokens=256)
def gen_character_spec_no_few_shot(s): ...

@function
def gen_character_normal(s): ...

@function(num_api_spec_tokens=1024)
def multi_turn_question(s, question_1, question_2): ...

def test_spec_single_turn(): ...
def test_inaccurate_spec_single_turn(): ...
def test_normal_single_turn(): ...
def test_spec_multi_turn(): ...
def test_spec_multi_turn_stream(): ...

Import

import sglang as sgl
from sglang import OpenAI, function, set_default_backend

I/O Contract

Inputs

Name Type Required Description
OPENAI_API_KEY env var Yes OpenAI API key for authentication
question_1 str Yes (for multi_turn_question) First question in multi-turn format
question_2 str Yes (for multi_turn_question) Second question in multi-turn format

Outputs

Name Type Description
name str Generated character name field
birthday str Generated character birthday field
job str Generated character job field
answer_1 str Answer to first question (multi-turn)
answer_2 str Answer to second question (multi-turn)

Usage Examples

import sglang as sgl
from sglang import OpenAI, function, set_default_backend

backend = OpenAI("gpt-4-turbo")
set_default_backend(backend)

# Speculative execution: all gen calls in a single assistant block
@function(num_api_spec_tokens=256)
def gen_character_spec(s):
    s += sgl.system("You are a helpful assistant.")
    s += sgl.user("Construct a character within the following format:")
    s += sgl.assistant(
        "Name: Steve Jobs.\nBirthday: February 24, 1955.\nJob: Apple CEO.\n"
    )
    s += sgl.user("Please generate new Name, Birthday and Job.\n")
    s += sgl.assistant(
        "Name:" + sgl.gen("name", stop="\n")
        + "\nBirthday:" + sgl.gen("birthday", stop="\n")
        + "\nJob:" + sgl.gen("job", stop="\n")
    )

state = gen_character_spec.run()
print("Name:", state["name"])
print("Birthday:", state["birthday"])
print("Job:", state["job"])

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment