Principle:Roboflow Rf detr Image Preprocessing
| Knowledge Sources | |
|---|---|
| Domains | Computer_Vision, Preprocessing |
| Last Updated | 2026-02-08 15:00 GMT |
Overview
The process of transforming raw images into normalized, resized tensors suitable for input to a neural network.
Description
Image preprocessing for object detection models involves three essential transforms applied in sequence:
- To Tensor: Convert PIL Images, numpy arrays, or file paths to PyTorch float tensors scaled to [0, 1]
- Normalize: Apply ImageNet channel-wise normalization with mean=[0.485, 0.456, 0.406] and std=[0.229, 0.224, 0.225]
- Resize: Scale images to the model's expected square resolution (e.g. 560x560 for Base)
These transforms ensure consistent input regardless of source image format, size, or value range. Original image dimensions are preserved for post-processing (mapping detections back to original coordinates).
Usage
Use this principle whenever feeding images to a pretrained vision model. The specific normalization statistics must match those used during model pretraining (ImageNet statistics for DINOv2-based models).
Theoretical Basis
Channel-wise normalization ensures each color channel has approximately zero mean and unit variance, matching the distribution the model was trained on. The formula for each pixel is:
Where μ and σ are the per-channel ImageNet statistics. This standardization prevents any single channel from dominating the learned features and ensures stable gradient flows.