Principle:Shiyu coder Kronos Predictor Finetuning
| Field | Value |
|---|---|
| principle_name | Predictor_Finetuning |
| repository | https://github.com/shiyu-coder/Kronos |
| domains | Training, Autoregressive_Models, Distributed_Training |
| implemented_by | Implementation:Shiyu_coder_Kronos_Train_Model_Predictor_Qlib |
| last_updated | 2026-02-09 14:00 GMT |
Summary
Fine-tuning the autoregressive Transformer predictor using cross-entropy loss on next-token prediction with frozen tokenizer encoding on-the-fly.
Concept
The Predictor Finetuning principle describes the second stage of the Kronos fine-tuning pipeline, where the autoregressive Transformer model is trained to predict the next token in a sequence of discrete codes produced by the (already fine-tuned) tokenizer. This follows the standard language modeling paradigm applied to quantized time series data.
The key distinction from tokenizer fine-tuning is that here, the tokenizer is frozen (eval mode, no gradients) and used purely as an encoder to convert continuous financial data into discrete tokens on-the-fly during training.
Theory
Frozen Tokenizer Encoding
During predictor training, the fine-tuned KronosTokenizer is loaded, set to eval mode, and used within a torch.no_grad() context to encode each batch:
- The tokenizer's
encode(batch_x, half=True)method produces two token sequences:(s1, s2), representing stage-1 and stage-2 quantization codes respectively - These are integer sequences representing indices in the learned codebook
- The
half=Trueparameter indicates that only the encoding half of the tokenizer is used (no decoding)
Next-Token Prediction
Following the standard autoregressive paradigm:
- Input tokens:
tokens[:, :-1](all tokens except the last) - Target tokens:
tokens[:, 1:](all tokens except the first) - The model learns to predict each token given all preceding tokens
This is applied independently to both the s1 and s2 token streams.
DualHead Cross-Entropy Loss
The Kronos model has a DualHead output layer that produces logits for both s1 and s2 tokens. The loss is computed as:
loss = DualHead.compute_loss(s1_logits, s2_logits, s1_targets, s2_targets)
This returns the combined cross-entropy loss across both heads, along with individual s1 and s2 losses for monitoring.
Optimization Strategy
- Optimizer: AdamW with betas (0.9, 0.95) and weight decay 0.1
- Learning rate:
predictor_learning_rate(default 4e-5), lower than the tokenizer learning rate since the predictor is typically larger - Scheduler: OneCycleLR with 3% warmup
- Gradient clipping: Max norm 3.0 (slightly higher than the tokenizer's 2.0)
- No gradient accumulation: Unlike tokenizer training, the predictor training loop does not use gradient accumulation steps
Distributed Training
Uses the same DDP pattern as tokenizer training:
DistributedSamplerfor data sharding across ranks- Validation loss aggregated via
dist.all_reduce(SUM) - Only rank 0 performs checkpointing, logging, and summary saving
- Epoch seeding ensures reproducible sampling across ranks
Domains
- Training: Model fine-tuning with next-token prediction objective
- Autoregressive_Models: Transformer-based sequence prediction
- Distributed_Training: DDP-based multi-GPU training