Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Sgl project Sglang Frontend Language Multi Turn Chat

From Leeroopedia


Knowledge Sources
Domains LLM_Inference, Prompt_Engineering, Frontend_DSL
Last Updated 2026-02-09 00:00 GMT

Overview

End-to-end process for building multi-turn conversational AI programs using SGLang's frontend domain-specific language (DSL) with features like branching, constrained generation, and batching.

Description

This workflow covers using SGLang's Python-native frontend language to construct complex generation programs that go beyond simple prompt-in, text-out patterns. The frontend DSL provides primitives for multi-turn conversations (sgl.user, sgl.assistant), parallel generation branches (fork/join), constrained decoding (choices, regex), streaming output, and batch execution. Programs are defined as decorated Python functions that compose these primitives, enabling sophisticated generation logic with automatic prompt caching and efficient execution.

Usage

Execute this workflow when you need to build generation programs with control flow, multi-turn dialogues, branching logic, or constrained outputs that go beyond simple API calls. Common use cases include multi-step reasoning agents, structured information extraction pipelines, tool-use decision making, and parallel hypothesis generation.

Execution Steps

Step 1: Initialize the SGLang Backend

Set up either a local Runtime (in-process model) or connect to a remote RuntimeEndpoint (running SGLang server). The backend handles all model execution while the frontend DSL defines the generation logic.

Key considerations:

  • sgl.Runtime(model_path=...) for local in-process execution
  • sgl.RuntimeEndpoint("http://host:port") for connecting to a running server
  • Set as default backend with sgl.set_default_backend()
  • Local runtime manages its own GPU memory and model lifecycle

Step 2: Define Generation Functions

Write Python functions decorated with @sgl.function that define the generation program. Within these functions, use the SGLang primitives to build multi-turn conversations, branch into parallel generation paths, and apply constraints to outputs.

Key considerations:

  • @sgl.function decorator enables SGLang state tracking
  • Append text with s += "text" to build the prompt
  • Use sgl.user() and sgl.assistant() for chat-formatted conversations
  • Use sgl.gen("name") to generate and capture output in a named variable
  • Use sgl.gen("name", choices=[...]) for constrained selection
  • Use sgl.gen("name", regex=pattern) for regex-constrained generation

Step 3: Implement Branching Logic

For programs requiring parallel exploration, use the fork primitive to create multiple generation branches that share the same prefix. Each branch generates independently and results can be joined back into the main execution flow.

Key considerations:

  • s.fork(n) creates n parallel branches sharing the prefix cache
  • Each fork can generate with different continuations
  • Branches execute in parallel leveraging RadixAttention prefix sharing
  • Results from forks are accessed by indexing the fork variable

Step 4: Execute the Generation Program

Run the defined function with input arguments using .run() for single execution, .run_batch() for batch execution, or with stream=True for streaming output. The runtime handles prompt caching, batching, and efficient execution.

Key considerations:

  • .run(arg1=val1, ...) for single execution returning a ProgramState
  • .run_batch([dict1, dict2, ...]) for batch execution returning list of states
  • stream=True enables token-by-token streaming via text_iter()
  • progress_bar=True shows batch execution progress

Step 5: Extract Results

Access generated outputs from the returned ProgramState object. Named generation variables are accessible by indexing the state, full text is available via state.text(), and chat messages via state.messages().

Key considerations:

  • state["variable_name"] retrieves a specific generated output
  • state.text() returns the full concatenated program text
  • state.messages() returns the conversation in message format
  • Multiple named generations can be accessed independently

Execution Diagram

GitHub URL

Workflow Repository