Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Sgl project Sglang Chunked Prefill OOM Prevention

From Leeroopedia




Knowledge Sources
Domains Optimization, Memory_Management
Last Updated 2026-02-10 00:00 GMT

Overview

Reducing `--chunked-prefill-size` to 4096 or 2048 to prevent OOM errors during prefill of long prompts, trading prefill speed for memory stability.

Description

Chunked prefill splits long input sequences into smaller chunks processed sequentially instead of all at once. This bounds the peak memory usage during the prefill phase, which is the most memory-intensive part of inference (attention over long input). The default chunked prefill size is typically 8192 tokens, but for models with large hidden dimensions or limited GPU VRAM, this can cause OOM. Reducing the chunk size caps peak activation memory at the cost of increased prefill latency for long prompts.

Usage

Use this heuristic when experiencing OOM errors specifically during prefill (not decode). Signs: OOM occurs when processing the first few requests with long prompts, but decode runs fine. Also useful for multimodal models where image/video tokens inflate the effective sequence length.

The Insight (Rule of Thumb)

  • Action: Set `--chunked-prefill-size 4096` or `--chunked-prefill-size 2048` if OOM during prefill
  • Value: Default is 8192 (or model-specific). Reduce by halving until stable.
  • Trade-off: Smaller chunks = less peak memory but slower prefill for long sequences (20-30% slower for 2048 vs 8192)
  • Disable: Set `--chunked-prefill-size -1` to disable chunking entirely (processes full sequence at once, fastest but highest memory)

For OOM during decode (not prefill):

  • Lower `--max-running-requests` instead
  • Or reduce `--mem-fraction-static`

NPU-specific defaults by memory:

  • 32GB NPU (Ascend 910B4): `--chunked-prefill-size 4096`
  • 64GB NPU (Ascend 910B1/B2/B3): `--chunked-prefill-size 8192`

CUDA graph interaction:

  • Increasing `--cuda-graph-max-bs` consumes more memory, may require reducing chunked prefill size or `--mem-fraction-static`
  • Default CUDA graph enabled for batch sizes < 160 or 256

Multimodal models:

  • Some VLM models have issues with chunked prefill; disable with `--chunked-prefill-size -1` if seeing errors
  • Or increase to larger value to accommodate image token sequences

Reasoning

During prefill, the attention mechanism computes over all input tokens simultaneously within each chunk. The activation memory scales as O(batch_size * chunk_size * hidden_dim) for the attention layer. For a 70B model with 8K hidden dim, a chunk of 8192 tokens requires significant temporary memory for Q, K, V projections and attention scores.

The sliding window interaction formula from `python/sglang/srt/utils/common.py`:

extend_input_len_swa_limit = page_size + 2 * max(sliding_window_size, chunked_prefill_size)

The 2x factor accounts for the fact that KV cache entries from previous chunks cannot be freed until the next chunk boundary.

Dynamic chunking for pipeline parallelism from `python/sglang/srt/managers/scheduler_pp_mixin.py`:

target = chunked_prefill_size * 1.25 - i * (chunked_prefill_size * 1.25 / 128)

The `SGLANG_DYNAMIC_CHUNKING_SMOOTH_FACTOR = 0.75` environment variable controls the EMA smoothing for dynamic chunk size adjustment across pipeline stages.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment