Heuristic:Sgl project Sglang Memory Fraction Tuning
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Memory_Management |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Tuning the `--mem-fraction-static` parameter to balance KV cache pool size against activation memory and CUDA graph buffers, targeting 5-8 GB of free GPU memory after model loading.
Description
The `--mem-fraction-static` parameter controls what fraction of GPU memory is statically allocated for the model weights and KV cache pool. SGLang uses a simple heuristic to set a default value, but optimal tuning depends on the specific model, GPU, and workload. The formula is: `mem_fraction_static = (model_weights + kv_cache_pool) / gpu_memory_capacity`. The remaining memory is used for activations, CUDA graph capture buffers, and PyTorch overhead. Too high a value causes OOM during inference; too low wastes GPU memory that could hold more concurrent requests.
Usage
Use this heuristic when you observe OOM errors during inference or when token usage is low despite queued requests. It is the most impactful single parameter for balancing throughput and stability.
The Insight (Rule of Thumb)
- Action: After loading the model, check available GPU memory. Adjust `--mem-fraction-static` so that 5-8 GB remains free.
- Value: Typical range is 0.6 to 0.9 depending on model size and GPU.
- Trade-off: Higher value = more KV cache slots = higher throughput, but risk of OOM. Lower value = safer but reduced concurrency.
- Iterative Approach: Increase by 0.01 increments until OOM occurs, then back off by 0.02.
Decision Framework:
- If `available_gpu_mem` is 5-8 GB after loading: Setting is optimal
- If `available_gpu_mem` > 10 GB: Increase `--mem-fraction-static` to utilize wasted memory
- If `available_gpu_mem` < 5 GB: Risk of OOM; decrease `--mem-fraction-static`
- If seeing `CUDA out of memory`: Decrease to 0.8 or 0.7
Reasoning
GPU memory is divided into three pools at runtime:
- Model weights (fixed, depends on model size and quantization)
- KV cache pool (controlled by `--mem-fraction-static`)
- Activations + CUDA graphs + overhead (the remaining 5-8 GB)
The third pool varies with batch size, sequence length, and whether CUDA graphs are enabled. CUDA graph capture can consume several GB for large batch sizes. Setting `--mem-fraction-static` too aggressively starves the activation pool, causing sporadic OOM errors under peak load. The 5-8 GB target was empirically determined across A100/H100 deployments.
Concrete examples from Ascend NPU best practices:
- Qwen3-32B (MoE) on 4x NPU: `--mem-fraction-static 0.86`
- DeepSeek-R1 (671B) on 64x NPU: `--mem-fraction-static 0.71`
- Llama-3.1-8B on 1x NPU: `--mem-fraction-static 0.81`