Heuristic:Shiyu coder Kronos Learning Rate And Optimizer Tuning

Knowledge Sources	Kronos Internal
Domains	Optimization, Training
Last Updated	2026-02-09 13:47 GMT

Overview

AdamW optimizer with aggressive beta2=0.95, strong weight decay=0.1, and OneCycleLR scheduler with 3% warmup for Kronos finetuning.

Description

Kronos uses AdamW optimizer with non-standard hyperparameters tuned for noisy financial time series. The key departures from typical defaults are: `beta2=0.95` (vs. standard 0.999) for more aggressive second-moment adaptation, `weight_decay=0.1` for strong regularization, and OneCycleLR scheduler with only 3% warmup (`pct_start=0.03`) and `div_factor=10`. The tokenizer and predictor use different learning rates (2e-4 vs 4e-5) reflecting their different adaptation needs.

Usage

Use this heuristic when:

Configuring finetuning hyperparameters: Start with these defaults before experimenting
Debugging training instability: These values were chosen for financial data with regime changes and noise
Adapting to new domains: The aggressive beta2 and strong weight decay may need adjustment for non-financial data

The Insight (Rule of Thumb)

Action: Use AdamW with `beta1=0.9`, `beta2=0.95`, `weight_decay=0.1`. Use OneCycleLR with `pct_start=0.03`, `div_factor=10`.
Tokenizer LR: `2e-4` (higher for codebook adaptation)
Predictor LR: `4e-5` (lower for stable sequence modeling)
Trade-off:
- `beta2=0.95` adapts faster to gradient variance changes but is noisier than 0.999
- `weight_decay=0.1` provides strong regularization but may underfit on small datasets
- `pct_start=0.03` (3% warmup) is very aggressive; standard is 10-30%

Reasoning

beta2=0.95 (vs. 0.999): Financial time series exhibit frequent regime changes (bull/bear markets, volatility spikes). A lower beta2 gives the optimizer a shorter memory for gradient variance, allowing faster adaptation to distributional shifts. Standard beta2=0.999 would be too sluggish for this domain.

weight_decay=0.1: Stronger than typical 0.01. Financial data is noisy with many spurious correlations. Heavy regularization prevents overfitting to short-term patterns that don't generalize.

pct_start=0.03: The 3% warmup phase is very short, meaning the optimizer quickly reaches peak learning rate. This works because the models are pretrained and already near a good initialization; they don't need extended warmup to stabilize.

div_factor=10: Initial learning rate is max_lr/10, providing a brief ramp-up period before full-speed training.

Evidence from `finetune/config.py:64-67`:

# AdamW optimizer parameters.
self.adam_beta1 = 0.9
self.adam_beta2 = 0.95
self.adam_weight_decay = 0.1

OneCycleLR scheduler from `finetune/train_predictor.py:77-81`:

scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer, max_lr=config['predictor_learning_rate'],
    steps_per_epoch=len(train_loader), epochs=config['epochs'],
    pct_start=0.03, div_factor=10
)

Different learning rates from `finetune/config.py:57-59`:

self.tokenizer_learning_rate = 2e-4
self.predictor_learning_rate = 4e-5

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment