Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Sgl project Sglang Memory Fraction Tuning

From Leeroopedia




Knowledge Sources
Domains Optimization, Memory_Management
Last Updated 2026-02-10 00:00 GMT

Overview

Tuning the `--mem-fraction-static` parameter to balance KV cache pool size against activation memory and CUDA graph buffers, targeting 5-8 GB of free GPU memory after model loading.

Description

The `--mem-fraction-static` parameter controls what fraction of GPU memory is statically allocated for the model weights and KV cache pool. SGLang uses a simple heuristic to set a default value, but optimal tuning depends on the specific model, GPU, and workload. The formula is: `mem_fraction_static = (model_weights + kv_cache_pool) / gpu_memory_capacity`. The remaining memory is used for activations, CUDA graph capture buffers, and PyTorch overhead. Too high a value causes OOM during inference; too low wastes GPU memory that could hold more concurrent requests.

Usage

Use this heuristic when you observe OOM errors during inference or when token usage is low despite queued requests. It is the most impactful single parameter for balancing throughput and stability.

The Insight (Rule of Thumb)

  • Action: After loading the model, check available GPU memory. Adjust `--mem-fraction-static` so that 5-8 GB remains free.
  • Value: Typical range is 0.6 to 0.9 depending on model size and GPU.
  • Trade-off: Higher value = more KV cache slots = higher throughput, but risk of OOM. Lower value = safer but reduced concurrency.
  • Iterative Approach: Increase by 0.01 increments until OOM occurs, then back off by 0.02.

Decision Framework:

  • If `available_gpu_mem` is 5-8 GB after loading: Setting is optimal
  • If `available_gpu_mem` > 10 GB: Increase `--mem-fraction-static` to utilize wasted memory
  • If `available_gpu_mem` < 5 GB: Risk of OOM; decrease `--mem-fraction-static`
  • If seeing `CUDA out of memory`: Decrease to 0.8 or 0.7

Reasoning

GPU memory is divided into three pools at runtime:

  1. Model weights (fixed, depends on model size and quantization)
  2. KV cache pool (controlled by `--mem-fraction-static`)
  3. Activations + CUDA graphs + overhead (the remaining 5-8 GB)

The third pool varies with batch size, sequence length, and whether CUDA graphs are enabled. CUDA graph capture can consume several GB for large batch sizes. Setting `--mem-fraction-static` too aggressively starves the activation pool, causing sporadic OOM errors under peak load. The 5-8 GB target was empirically determined across A100/H100 deployments.

Concrete examples from Ascend NPU best practices:

  • Qwen3-32B (MoE) on 4x NPU: `--mem-fraction-static 0.86`
  • DeepSeek-R1 (671B) on 64x NPU: `--mem-fraction-static 0.71`
  • Llama-3.1-8B on 1x NPU: `--mem-fraction-static 0.81`

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment