Heuristic:Shiyu coder Kronos Learning Rate And Optimizer Tuning
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Training |
| Last Updated | 2026-02-09 13:47 GMT |
Overview
AdamW optimizer with aggressive beta2=0.95, strong weight decay=0.1, and OneCycleLR scheduler with 3% warmup for Kronos finetuning.
Description
Kronos uses AdamW optimizer with non-standard hyperparameters tuned for noisy financial time series. The key departures from typical defaults are: `beta2=0.95` (vs. standard 0.999) for more aggressive second-moment adaptation, `weight_decay=0.1` for strong regularization, and OneCycleLR scheduler with only 3% warmup (`pct_start=0.03`) and `div_factor=10`. The tokenizer and predictor use different learning rates (2e-4 vs 4e-5) reflecting their different adaptation needs.
Usage
Use this heuristic when:
- Configuring finetuning hyperparameters: Start with these defaults before experimenting
- Debugging training instability: These values were chosen for financial data with regime changes and noise
- Adapting to new domains: The aggressive beta2 and strong weight decay may need adjustment for non-financial data
The Insight (Rule of Thumb)
- Action: Use AdamW with `beta1=0.9`, `beta2=0.95`, `weight_decay=0.1`. Use OneCycleLR with `pct_start=0.03`, `div_factor=10`.
- Tokenizer LR: `2e-4` (higher for codebook adaptation)
- Predictor LR: `4e-5` (lower for stable sequence modeling)
- Trade-off:
- `beta2=0.95` adapts faster to gradient variance changes but is noisier than 0.999
- `weight_decay=0.1` provides strong regularization but may underfit on small datasets
- `pct_start=0.03` (3% warmup) is very aggressive; standard is 10-30%
Reasoning
beta2=0.95 (vs. 0.999): Financial time series exhibit frequent regime changes (bull/bear markets, volatility spikes). A lower beta2 gives the optimizer a shorter memory for gradient variance, allowing faster adaptation to distributional shifts. Standard beta2=0.999 would be too sluggish for this domain.
weight_decay=0.1: Stronger than typical 0.01. Financial data is noisy with many spurious correlations. Heavy regularization prevents overfitting to short-term patterns that don't generalize.
pct_start=0.03: The 3% warmup phase is very short, meaning the optimizer quickly reaches peak learning rate. This works because the models are pretrained and already near a good initialization; they don't need extended warmup to stabilize.
div_factor=10: Initial learning rate is max_lr/10, providing a brief ramp-up period before full-speed training.
Evidence from `finetune/config.py:64-67`:
# AdamW optimizer parameters.
self.adam_beta1 = 0.9
self.adam_beta2 = 0.95
self.adam_weight_decay = 0.1
OneCycleLR scheduler from `finetune/train_predictor.py:77-81`:
scheduler = torch.optim.lr_scheduler.OneCycleLR(
optimizer, max_lr=config['predictor_learning_rate'],
steps_per_epoch=len(train_loader), epochs=config['epochs'],
pct_start=0.03, div_factor=10
)
Different learning rates from `finetune/config.py:57-59`:
self.tokenizer_learning_rate = 2e-4
self.predictor_learning_rate = 4e-5