Heuristic:Shiyu coder Kronos Two Stage Finetuning Strategy
| Knowledge Sources | |
|---|---|
| Domains | Training, Optimization |
| Last Updated | 2026-02-09 13:47 GMT |
Overview
Sequential two-stage finetuning strategy: train the VQ-VAE tokenizer first, then finetune the autoregressive predictor using the frozen finetuned tokenizer.
Description
Kronos employs a strict two-phase finetuning pipeline. In Phase 1, the VQ-VAE tokenizer is finetuned on the target domain data to adapt its codebook to the new data distribution. In Phase 2, the autoregressive predictor is finetuned using the frozen (non-trainable) finetuned tokenizer for on-the-fly token encoding. This ordering is critical: the predictor must learn to predict tokens from the adapted tokenizer, not the pretrained one. The tokenizer is always set to `eval()` mode during predictor training.
Usage
Use this heuristic when:
- Finetuning Kronos on new data: Always train tokenizer first, then predictor
- Debugging poor finetuning results: Verify tokenizer quality before blaming the predictor
- Deciding whether to skip tokenizer finetuning: If your data is similar to the pretrained distribution, you may skip Phase 1 (configurable via `train_tokenizer` flag)
The Insight (Rule of Thumb)
- Action: Always finetune the tokenizer (Phase 1) before the predictor (Phase 2). Never finetune them jointly.
- Value: Phase 1 uses `tokenizer_learning_rate=2e-4`; Phase 2 uses `predictor_learning_rate=4e-5` (5x lower).
- Trade-off: Sequential training takes longer than joint training but prevents the predictor from compensating for a poorly adapted tokenizer. The tokenizer learns the data distribution first (unsupervised reconstruction), then the predictor builds task-specific patterns on top.
- Skip option: Both phases can be independently enabled/disabled via `train_tokenizer` and `train_basemodel` config flags.
Reasoning
The two-stage strategy mirrors the model's architecture: the tokenizer converts continuous OHLCV data into discrete tokens, and the predictor models token sequences autoregressively. If both were trained jointly, gradient signals from the predictor would interfere with the tokenizer's codebook learning, leading to unstable training. By freezing the tokenizer during Phase 2, the predictor receives a stable tokenization and can focus on learning temporal dependencies.
The 5x higher learning rate for the tokenizer (2e-4 vs 4e-5) reflects that the tokenizer needs more aggressive adaptation to learn the codebook distribution, while the predictor benefits from conservative updates that preserve pretrained sequence modeling capabilities.
Evidence from `finetune_csv/train_sequential.py:264-306`:
if self.config.train_tokenizer:
success = self.train_tokenizer_phase() # Phase 1
else:
print("Skipping Tokenizer training phase")
if self.config.train_basemodel:
success = self.train_basemodel_phase() # Phase 2
else:
print("Skipping Basemodel training phase")
Frozen tokenizer usage during predictor training from `finetune/train_predictor.py:99-101`:
# Tokenize input data on-the-fly
with torch.no_grad():
token_seq_0, token_seq_1 = tokenizer.encode(batch_x, half=True)
Learning rate configuration from `finetune/config.py:57-59`:
# Learning rates for different model components.
self.tokenizer_learning_rate = 2e-4
self.predictor_learning_rate = 4e-5
Related Pages
- Implementation:Shiyu_coder_Kronos_SequentialTrainer_Usage
- Implementation:Shiyu_coder_Kronos_Train_Model_Tokenizer_Qlib
- Implementation:Shiyu_coder_Kronos_Train_Model_Predictor_Qlib
- Principle:Shiyu_coder_Kronos_Sequential_Two_Stage_Training
- Principle:Shiyu_coder_Kronos_Tokenizer_Finetuning
- Principle:Shiyu_coder_Kronos_Predictor_Finetuning