Implementation:Sgl project Sglang CoT Decoding
| Knowledge Sources | |
|---|---|
| Domains | Inference, Reasoning |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Implements Chain-of-Thought (CoT) decoding as described in arXiv:2402.10200, which elicits reasoning from LLMs by exploring alternative first tokens.
Description
cot_decoding.py uses SGLang's fork/join parallelism and log probability APIs to implement a research-level decoding algorithm. The cot_decoding function, decorated with @sgl.function, explores the top-k alternative tokens at the first decoding step using s.fork(). For each alternative starting token, it continues with greedy decoding (temperature=0) and calculates a "probability disparity" score -- the average difference between top-1 and top-2 token probabilities across all decoded positions.
Higher disparity scores indicate paths where the model exhibits more confident reasoning. The algorithm then extracts answer spans from each path by appending "So the answer is" and generating further. This approach improves reasoning without any prompt engineering.
The implementation uses return_logprob=True and top_logprobs_num throughout to access token-level probability information, and provides verbose colored terminal output for debugging and analysis.
Usage
Use this example to implement CoT decoding for math and reasoning tasks where exploring alternative first tokens can lead to better chain-of-thought reasoning paths. Requires a running SGLang runtime endpoint.
Code Reference
Source Location
- Repository: Sgl_project_Sglang
- File: examples/frontend_language/usage/cot_decoding.py
- Lines: 1-115
Signature
@sgl.function
def cot_decoding(s, question, get_top_k, is_chat_model, verbose): ...
Import
from math import exp
from pprint import pformat
import sglang as sgl
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| question | str | Yes | The question to answer using CoT decoding |
| get_top_k | int | Yes | Number of alternative first tokens to explore |
| is_chat_model | bool | Yes | Whether the model uses chat template format |
| verbose | bool | Yes | Whether to print detailed per-token probability information |
Outputs
| Name | Type | Description |
|---|---|---|
| Console output | str | Colored terminal output showing each path's first token, probability disparity score, and extracted answer |
| get_top_k | generation metadata | Top-k tokens and log probabilities from the first decoding step |
| answer | str | Generated continuation for each path via greedy decoding |
| answer_span | str | Extracted answer span from "So the answer is" prompting |
Usage Examples
import sglang as sgl
sgl.set_default_backend(sgl.RuntimeEndpoint("http://localhost:30000"))
state = cot_decoding.run(
question="Claire makes a 3 egg omelet every morning for breakfast. "
"How many dozens of eggs will she eat in 4 weeks?",
get_top_k=10,
is_chat_model=True,
verbose=False,
)