Heuristic:Roboflow Rf detr Batch Size Memory Tradeoff
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Computer_Vision, Deep_Learning |
| Last Updated | 2026-02-08 15:00 GMT |
Overview
Memory optimization technique using gradient accumulation to maintain an effective batch size of 16 across different GPU VRAM capacities, with specific configurations for 8GB to 80GB GPUs.
Description
RF-DETR training uses an effective batch size calculated as `batch_size * grad_accum_steps * num_gpus`. Since the model processes images at resolutions of 384-784 pixels with a ViT backbone, GPU VRAM is the primary bottleneck. Gradient accumulation allows splitting the effective batch across multiple forward passes, trading training speed for reduced peak memory usage. The default configuration targets an effective batch size of 16.
Usage
Use this heuristic when configuring training for your specific GPU hardware. Apply it whenever you encounter CUDA OOM errors or want to maximize GPU utilization. Also apply when switching between single-GPU and multi-GPU setups to maintain the same effective batch size.
The Insight (Rule of Thumb)
- Action: Set `batch_size` and `grad_accum_steps` based on your GPU VRAM, targeting effective batch size of 16.
- Value:
- 8GB VRAM (RTX 3070): `batch_size=1, grad_accum_steps=16, gradient_checkpointing=True`
- 12GB VRAM: `batch_size=2, grad_accum_steps=8, gradient_checkpointing=True`
- 16GB VRAM (T4): `batch_size=4, grad_accum_steps=4`
- 24GB VRAM (RTX 3090/4090): `batch_size=8, grad_accum_steps=2`
- 40-80GB VRAM (A100): `batch_size=16, grad_accum_steps=1`
- Trade-off: Lower `batch_size` with higher `grad_accum_steps` is mathematically equivalent but slower per epoch due to more forward passes. Gradient checkpointing reduces VRAM by ~30-40% at the cost of ~20% slower training.
- Constraint: `batch_size` must be divisible by `grad_accum_steps` (enforced by assertion in `engine.py:87`).
Reasoning
The ViT backbone (DINOv2) processes images at fixed resolution with quadratic attention, making VRAM usage proportional to `batch_size * resolution^2`. Gradient accumulation does not change the mathematical result of the optimization step — it sums gradients across sub-batches before updating weights. The assertion `assert batch_size % args.grad_accum_steps == 0` in the training loop ensures clean sub-batch splitting.
For multi-GPU training, the effective batch size is further multiplied by the number of GPUs, so `grad_accum_steps` should be reduced accordingly:
| GPUs | batch_size | grad_accum_steps | Effective |
|---|---|---|---|
| 1 | 4 | 4 | 16 |
| 2 | 4 | 2 | 16 |
| 4 | 4 | 1 | 16 |
| 8 | 2 | 1 | 16 |