Heuristic:Roboflow Rf detr Resolution Divisibility Rule
| Knowledge Sources | |
|---|---|
| Domains | Configuration, Computer_Vision, Deep_Learning |
| Last Updated | 2026-02-08 15:00 GMT |
Overview
Input resolution must be divisible by 14 (for ONNX export) and by 56 (for training with multi-scale augmentation), with each model variant using a carefully chosen default resolution tied to its patch size and window count.
Description
RF-DETR uses a DINOv2 Vision Transformer backbone with configurable patch sizes (12, 14, or 16 pixels). The resolution must produce an integer number of patches. Additionally, multi-scale training computes valid scales based on `resolution`, `patch_size`, and `num_windows`, requiring the resolution to be divisible by `patch_size * num_windows`. For ONNX export, the shape must be divisible by 14. The training docs state resolution "must be divisible by 56."
Usage
Apply this rule when customizing the input resolution for training or ONNX export. Using a non-divisible resolution causes shape mismatches in the ViT backbone or explicit `ValueError` during export.
The Insight (Rule of Thumb)
- Action: Always use resolutions that are divisible by `patch_size * num_windows` for training, and divisible by 14 for ONNX export.
- Value: Default resolutions per model variant:
- Nano: 384 (patch_size=16, num_windows=2, 384/32=12)
- Small: 512 (patch_size=16, num_windows=2, 512/32=16)
- Base: 560 (patch_size=14, num_windows=4, 560/56=10)
- Medium: 576 (patch_size=16, num_windows=2, 576/32=18)
- Large: 704 (patch_size=16, num_windows=2, 704/32=22)
- Safe custom resolutions: 448, 504, 560, 616, 672, 728, 784, 840, 896 (all divisible by 56)
- Trade-off: Higher resolution improves detection of small objects but requires quadratically more VRAM and compute.
Reasoning
The ViT backbone divides the image into non-overlapping patches of `patch_size x patch_size`. Windowed attention further groups patches into windows. For multi-scale training, `compute_multi_scale_scales()` generates valid scale factors that maintain divisibility. If the resolution is not divisible, the feature map dimensions become fractional, causing tensor shape mismatches in the attention layers.
ONNX export validation from `rfdetr/main.py:573-574`:
if shape[0] % 14 != 0 or shape[1] % 14 != 0:
raise ValueError("Shape must be divisible by 14")
The `positional_encoding_size` is computed as `resolution // patch_size` in config classes (e.g., `rfdetr/config.py:139`):
positional_encoding_size: int = 704 // 16