Heuristic:InternLM Lmdeploy OOM Troubleshooting
| Knowledge Sources | |
|---|---|
| Domains | Debugging, Memory_Management |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Systematic troubleshooting guide for CUDA out-of-memory errors during LMDeploy inference, covering KV cache reduction, quantization, and tensor parallelism strategies.
Description
CUDA OOM is the most common error when deploying LLMs with LMDeploy. Memory is consumed by three main components: model weights, KV cache, and activation memory. Since model weights are fixed by the model architecture and precision, the primary tuning lever is the KV cache allocation. This heuristic provides a systematic approach to resolving OOM errors, from simple parameter adjustments to architectural changes.
Usage
Use this heuristic when you encounter:
- `RuntimeError: [TM][ERROR] CUDA runtime error: out of memory`
- `torch.cuda.OutOfMemoryError`
- The inference engine fails to initialize due to insufficient memory.
- Requests are rejected or slow due to memory pressure during serving.
The Insight (Rule of Thumb)
Step 1 - Reduce KV cache allocation (quickest fix):
- Action: Set `cache_max_entry_count=0.2` (down from default 0.8).
- Trade-off: Fewer concurrent sequences, but inference will start.
Step 2 - Use quantization:
- Action: Apply W4A16 (AWQ) quantization to reduce model weight memory by ~4x.
- Trade-off: Minor accuracy loss; significant memory savings.
Step 3 - Enable KV cache quantization:
- Action: Set `quant_policy=4` (INT4) or `quant_policy=8` (INT8) to compress KV cache.
- Trade-off: INT8 preserves accuracy; INT4 has slight accuracy loss but 4x cache capacity.
Step 4 - Increase tensor parallelism:
- Action: Set `tp=2` or `tp=4` to shard the model across multiple GPUs.
- Trade-off: Requires multiple GPUs; adds inter-GPU communication overhead.
Step 5 - Reduce max prefill tokens:
- Action: Lower `max_prefill_token_num` (default TM: 8192, PT: 4096).
- Trade-off: Slower prefill for long prompts; less peak activation memory.
Memory budget rule: The pipeline limits logits tensor to 2GB maximum. For PPL (perplexity) computation, `max_input_len = 2GB / (vocab_size * 4 bytes)`.
Reasoning
GPU memory allocation in LMDeploy follows this priority:
- Model weights (fixed): FP16 7B model = ~14GB, FP16 13B = ~26GB.
- KV cache (configurable): Controlled by `cache_max_entry_count` as percentage of remaining free memory.
- Activations (dynamic): Scales with `max_prefill_token_num` and batch size.
The default `cache_max_entry_count=0.8` is aggressive, allocating 80% of free memory after model loading. On a 24GB GPU with a 14GB model, this leaves only ~2GB headroom. Any spike in activation memory during prefill can trigger OOM.
Code evidence from `lmdeploy/pipeline.py:270-273`:
# TODO: a better way to determine `max_input_len`, at most
# allocate 2G mem for logits with shape [bs, max_input_len, vocab_size]
vocab_size = self.async_engine.hf_cfg.vocab_size
max_input_len = 2 * 1024**3 // (vocab_size * 4)
From FAQ documentation:
# Quick OOM fix for pipeline
from lmdeploy import pipeline, TurbomindEngineConfig
backend_config = TurbomindEngineConfig(cache_max_entry_count=0.2)
pipe = pipeline('model_path', backend_config=backend_config)
# Quick OOM fix for CLI
# lmdeploy chat model --cache-max-entry-count 0.2
# Quick OOM fix for API server
# lmdeploy serve api_server model --cache-max-entry-count 0.2