Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Sgl project Sglang Schedule Conservativeness Tuning

From Leeroopedia




Knowledge Sources
Domains Optimization, Scheduling
Last Updated 2026-02-10 00:00 GMT

Overview

Tuning `--schedule-conservativeness` to balance between maximizing throughput (admitting more requests) and avoiding costly KV cache retractions when the cache pool fills up.

Description

SGLang's scheduler uses a new token ratio to estimate how much KV cache space will be needed by currently running requests. This ratio controls how aggressively the scheduler admits new prefill requests. The `--schedule-conservativeness` parameter is a multiplier on the initial ratio: higher values make the scheduler more cautious (admits fewer requests), lower values make it more aggressive (admits more requests). When the scheduler over-commits and the KV cache fills up, it must retract running decode requests, which is expensive and wastes computation.

Usage

Use this heuristic when you observe either: (1) low token usage despite queued requests (scheduler too conservative, decrease the value), or (2) frequent retraction warnings in logs (scheduler too aggressive, increase the value). Acceptable retraction frequency is approximately one per minute.

The Insight (Rule of Thumb)

  • Action: Adjust `--schedule-conservativeness` based on log metrics.
  • Default Value: 1.0 (maps to `init_new_token_ratio = 0.7`)
  • If token_usage < 0.9 and queue-req > 0: Decrease to ~0.3 (more aggressive)
  • If seeing frequent retractions: Increase to ~1.3 (more conservative)
  • Trade-off: Lower = higher throughput but risk of retractions; Higher = stable but underutilized

Healthy metrics to watch:

  • `#queue-req`: Healthy range is 100-2000
  • `token_usage`: Target > 0.9 for good utilization
  • Retraction warnings: Acceptable ~1 per minute

Reasoning

The scheduler's admission control works through a decaying ratio:

init_new_token_ratio = min(0.7 * schedule_conservativeness, 1.0)
min_new_token_ratio = min(init_new_token_ratio * 0.14, 1.0)
# Decays linearly over 600 steps from init to min

From `python/sglang/srt/environ.py:216-221`:

SGLANG_INIT_NEW_TOKEN_RATIO = EnvFloat(0.7)
SGLANG_MIN_NEW_TOKEN_RATIO_FACTOR = EnvFloat(0.14)
SGLANG_NEW_TOKEN_RATIO_DECAY_STEPS = EnvInt(600)
SGLANG_RETRACT_DECODE_STEPS = EnvInt(20)
SGLANG_CLIP_MAX_NEW_TOKENS_ESTIMATION = EnvInt(4096)

When a retraction occurs, the ratio is bumped back up and the scheduler becomes more cautious. The retraction warning from `python/sglang/srt/managers/scheduler.py`:

"KV cache pool is full. Retract requests. #retracted_reqs: {}, #new_token_ratio: {:.4f} -> {:.4f}"

The decay mechanism (600 steps) means the scheduler gradually becomes more aggressive over time, admitting more requests as it gains confidence that the KV cache can handle the load. The 0.14 factor for the minimum ratio was empirically tuned to prevent over-commitment.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment