Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Sgl project Sglang CoT Decoding

From Leeroopedia


Knowledge Sources
Domains Inference, Reasoning
Last Updated 2026-02-10 00:00 GMT

Overview

Implements Chain-of-Thought (CoT) decoding as described in arXiv:2402.10200, which elicits reasoning from LLMs by exploring alternative first tokens.

Description

cot_decoding.py uses SGLang's fork/join parallelism and log probability APIs to implement a research-level decoding algorithm. The cot_decoding function, decorated with @sgl.function, explores the top-k alternative tokens at the first decoding step using s.fork(). For each alternative starting token, it continues with greedy decoding (temperature=0) and calculates a "probability disparity" score -- the average difference between top-1 and top-2 token probabilities across all decoded positions.

Higher disparity scores indicate paths where the model exhibits more confident reasoning. The algorithm then extracts answer spans from each path by appending "So the answer is" and generating further. This approach improves reasoning without any prompt engineering.

The implementation uses return_logprob=True and top_logprobs_num throughout to access token-level probability information, and provides verbose colored terminal output for debugging and analysis.

Usage

Use this example to implement CoT decoding for math and reasoning tasks where exploring alternative first tokens can lead to better chain-of-thought reasoning paths. Requires a running SGLang runtime endpoint.

Code Reference

Source Location

Signature

@sgl.function
def cot_decoding(s, question, get_top_k, is_chat_model, verbose): ...

Import

from math import exp
from pprint import pformat

import sglang as sgl

I/O Contract

Inputs

Name Type Required Description
question str Yes The question to answer using CoT decoding
get_top_k int Yes Number of alternative first tokens to explore
is_chat_model bool Yes Whether the model uses chat template format
verbose bool Yes Whether to print detailed per-token probability information

Outputs

Name Type Description
Console output str Colored terminal output showing each path's first token, probability disparity score, and extracted answer
get_top_k generation metadata Top-k tokens and log probabilities from the first decoding step
answer str Generated continuation for each path via greedy decoding
answer_span str Extracted answer span from "So the answer is" prompting

Usage Examples

import sglang as sgl

sgl.set_default_backend(sgl.RuntimeEndpoint("http://localhost:30000"))

state = cot_decoding.run(
    question="Claire makes a 3 egg omelet every morning for breakfast. "
             "How many dozens of eggs will she eat in 4 weeks?",
    get_top_k=10,
    is_chat_model=True,
    verbose=False,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment