Heuristic:Roboflow Rf detr Layer Wise LR Decay
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Deep_Learning, Computer_Vision |
| Last Updated | 2026-02-08 15:00 GMT |
Overview
Differential learning rate strategy that applies exponential decay across ViT encoder layers and a separate decay multiplier for the decoder, preserving pretrained backbone features while allowing task-specific heads to learn faster.
Description
RF-DETR applies three different learning rate groups during fine-tuning: (1) the backbone ViT encoder with per-layer exponential LR decay, (2) the transformer decoder with a component decay multiplier, and (3) all other parameters (detection head, projector) at the base LR. Additionally, certain backbone parameters (gamma, pos_embed, rel_pos, bias, norm) have their weight decay set to zero to preserve learned representations.
Usage
This heuristic is automatically applied during training via `get_param_dict()`. Adjust `lr_vit_layer_decay` (default 0.8) and `lr_component_decay` (default 0.7) when you need more or less fine-tuning of the backbone. Use lower decay values to freeze deeper backbone layers more aggressively.
The Insight (Rule of Thumb)
- Action: Configure three LR groups with different rates.
- Value:
- Backbone encoder: `lr_encoder=1.5e-4` with layer-wise exponential decay factor `lr_vit_layer_decay=0.8`. Layer L gets LR = `lr_encoder * 0.8^(num_layers + 1 - layer_id)`. Earlier layers get exponentially smaller LR.
- Decoder: `lr * lr_component_decay = 1e-4 * 0.7 = 7e-5`
- Other (head, projector): `lr = 1e-4`
- Trade-off: Lower `lr_vit_layer_decay` preserves pretrained features better (good for small datasets) but limits adaptation. Higher values allow more backbone adaptation (good for domain-shifted data).
- Weight decay exception: Bias, normalization, positional embedding, and gamma parameters get `weight_decay=0` regardless of the global setting.
Reasoning
The DINOv2 backbone is pretrained on a massive image dataset. Earlier layers learn general features (edges, textures) that transfer well; later layers learn more task-specific features. Layer-wise LR decay preserves the general features in early layers while allowing later layers to adapt more freely.
Layer ID assignment from `rfdetr/util/get_param_dicts.py:30-37`:
layer_id = num_layers + 1
if name.startswith("backbone"):
if ".pos_embed" in name or ".patch_embed" in name:
layer_id = 0
elif ".blocks." in name and ".residual." not in name:
layer_id = int(name[name.find(".blocks."):].split(".")[2]) + 1
return lr_decay_rate ** (num_layers + 1 - layer_id)
Weight decay zeroing for special parameters from `rfdetr/util/get_param_dicts.py:50-52`:
if ('gamma' in name) or ('pos_embed' in name) or ('rel_pos' in name) \
or ('bias' in name) or ('norm' in name):
weight_decay_rate = 0.
Decoder component decay from `rfdetr/util/get_param_dicts.py:68-70`:
decoder_param_lr_pairs = [
{"params": param, "lr": args.lr * args.lr_component_decay}
for param in decoder_params
]