Principle:Roboflow Rf detr Model Size Selection
| Knowledge Sources | |
|---|---|
| Domains | Object_Detection, Model_Architecture |
| Last Updated | 2026-02-08 15:00 GMT |
Overview
A design principle for selecting the appropriate model size variant to balance accuracy, speed, and resource constraints in object detection tasks.
Description
Model size selection involves choosing from a family of architecturally similar models that differ in computational complexity and detection performance. In the RF-DETR family, model variants range from Nano (fastest, least accurate) to Large (most accurate, highest compute). Each variant uses the same DINOv2-based backbone with a lightweight DETR decoder but varies in input resolution, number of decoder layers, and patch size.
The key trade-off axes are:
- Input resolution: Higher resolution captures finer details but increases compute quadratically
- Decoder depth: More layers improve detection quality but add latency
- Backbone configuration: Window size and feature extraction layers affect the quality-speed balance
Usage
Use this principle when deploying an object detection system and you need to choose between detection accuracy and inference speed. Consider Nano/Small for edge deployment or real-time applications. Choose Base/Medium for balanced performance. Use Large when accuracy is paramount and compute resources are available.
Theoretical Basis
Model scaling follows the observation that detection performance improves with increased model capacity, but with diminishing returns. The RF-DETR family scales along three dimensions:
- Resolution scaling: Input sizes range from 384px (Nano) to 704px (Large)
- Depth scaling: Decoder layers range from 2 (Nano) to 4 (Large)
- Width scaling: Hidden dimensions remain constant at 256, but attention heads vary
Each variant uses a Pydantic configuration class that encapsulates all architecture parameters, ensuring consistent and validated model construction.