Implementation:Sgl project Sglang CoT Decoding

Knowledge Sources	Sgl_project_Sglang
Domains	Inference, Reasoning
Last Updated	2026-02-10 00:00 GMT

Overview

Implements Chain-of-Thought (CoT) decoding as described in arXiv:2402.10200, which elicits reasoning from LLMs by exploring alternative first tokens.

Description

cot_decoding.py uses SGLang's fork/join parallelism and log probability APIs to implement a research-level decoding algorithm. The cot_decoding function, decorated with @sgl.function, explores the top-k alternative tokens at the first decoding step using s.fork(). For each alternative starting token, it continues with greedy decoding (temperature=0) and calculates a "probability disparity" score -- the average difference between top-1 and top-2 token probabilities across all decoded positions.

Higher disparity scores indicate paths where the model exhibits more confident reasoning. The algorithm then extracts answer spans from each path by appending "So the answer is" and generating further. This approach improves reasoning without any prompt engineering.

The implementation uses return_logprob=True and top_logprobs_num throughout to access token-level probability information, and provides verbose colored terminal output for debugging and analysis.

Usage

Use this example to implement CoT decoding for math and reasoning tasks where exploring alternative first tokens can lead to better chain-of-thought reasoning paths. Requires a running SGLang runtime endpoint.

Code Reference

Source Location

Repository: Sgl_project_Sglang
File: examples/frontend_language/usage/cot_decoding.py
Lines: 1-115

Signature

@sgl.function
def cot_decoding(s, question, get_top_k, is_chat_model, verbose): ...

Import

from math import exp
from pprint import pformat

import sglang as sgl

I/O Contract

Inputs

Name	Type	Required	Description
question	str	Yes	The question to answer using CoT decoding
get_top_k	int	Yes	Number of alternative first tokens to explore
is_chat_model	bool	Yes	Whether the model uses chat template format
verbose	bool	Yes	Whether to print detailed per-token probability information

Outputs

Name	Type	Description
Console output	str	Colored terminal output showing each path's first token, probability disparity score, and extracted answer
get_top_k	generation metadata	Top-k tokens and log probabilities from the first decoding step
answer	str	Generated continuation for each path via greedy decoding
answer_span	str	Extracted answer span from "So the answer is" prompting

Usage Examples

import sglang as sgl

sgl.set_default_backend(sgl.RuntimeEndpoint("http://localhost:30000"))

state = cot_decoding.run(
    question="Claire makes a 3 egg omelet every morning for breakfast. "
             "How many dozens of eggs will she eat in 4 weeks?",
    get_top_k=10,
    is_chat_model=True,
    verbose=False,
)

Related Pages

Environment:Sgl_project_Sglang_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment