Workflow:Roboflow Rf detr Object Detection Inference

Knowledge Sources	RF-DETR RF-DETR Docs RF-DETR Paper
Domains	Computer_Vision, Object_Detection, Inference
Last Updated	2026-02-08 15:00 GMT

Overview

End-to-end process for running object detection inference on images or video using pretrained RF-DETR models with the DINOv2 vision transformer backbone.

Description

This workflow covers the standard procedure for performing real-time object detection using RF-DETR. It loads a pretrained model (available in sizes from Nano to 2XLarge), preprocesses input images through normalization and resizing, runs a forward pass through the DINOv2 backbone and transformer decoder, applies post-processing to extract bounding boxes with confidence scores, and visualizes results using the supervision library. The workflow supports single images, batched images, video files, webcam streams, and RTSP streams.

Usage

Execute this workflow when you need to detect objects in images or video using a pretrained RF-DETR model. This is the primary use case for users who want out-of-the-box detection on COCO-class objects (80 categories) or who have a fine-tuned checkpoint and want to run inference on new data. The workflow supports multiple model sizes to balance accuracy and latency requirements.

Execution Steps

Step 1: Select Model Size

Choose the appropriate RF-DETR model variant based on your accuracy and latency requirements. Models range from Nano (2.3ms, 48.4 AP) to 2XLarge (17.2ms, 60.1 AP). Each size class (RFDETRNano, RFDETRSmall, RFDETRMedium, RFDETRBase, RFDETRLarge) automatically configures the correct backbone encoder, decoder layers, resolution, and patch size.

Key considerations:

Nano through Large are Apache 2.0 licensed; XLarge and 2XLarge require a Roboflow account
Higher resolution models achieve better accuracy but require more VRAM and have higher latency
For custom fine-tuned models, pass the checkpoint path via the pretrain_weights parameter

Step 2: Initialize Model

Instantiate the chosen model class. During initialization, the model configuration is created (encoder type, hidden dimensions, decoder layers, resolution), pretrained weights are downloaded from Google Cloud Storage if not already cached locally, the LWDETR architecture is constructed (DINOv2 backbone, multi-scale projector, transformer decoder), and weights are loaded into the model.

Key considerations:

Weights are cached locally after first download
The model automatically detects the available device (CUDA, MPS, or CPU)
For fine-tuned models, the detection head is reinitialized if the number of classes differs from the checkpoint

Step 3: Prepare Input

Load the input image from a file path, PIL Image, NumPy array, or torch Tensor. The image must be in RGB channel order. If providing a torch Tensor, it must already be normalized to the [0, 1] range with shape (C, H, W).

Key considerations:

Multiple image formats are accepted: file paths, PIL Images, NumPy arrays, torch Tensors
Batch inference is supported by passing a list of images
The predict method handles all preprocessing internally

Step 4: Run Prediction

Call the predict method with the prepared image and a confidence threshold. Internally, the image is converted to a tensor, normalized with ImageNet mean and standard deviation values, resized to the model resolution, and passed through the neural network. The model outputs class logits and bounding box coordinates, which are post-processed to produce detections in the original image coordinate space.

Key considerations:

The default confidence threshold is 0.5; adjust based on precision/recall needs
For optimized inference latency, call optimize_for_inference() before predicting (uses JIT tracing)
The output is a supervision Detections object containing xyxy bounding boxes, confidence scores, and class IDs

Step 5: Visualize Results

Map class IDs to human-readable labels using the COCO class mapping or the model's custom class names. Annotate the image with bounding boxes and labels using the supervision library's BoxAnnotator and LabelAnnotator.

Key considerations:

The supervision library provides multiple annotation styles (boxes, labels, masks, halos)
For video or streaming use cases, process frames in a loop with OpenCV for capture and display
Class names for fine-tuned models are stored in the model checkpoint and loaded automatically

Execution Diagram

GitHub URL

Workflow Repository