Principle:Shiyu coder Kronos Predictor Finetuning

Field	Value
principle_name	Predictor_Finetuning
repository	https://github.com/shiyu-coder/Kronos
domains	Training, Autoregressive_Models, Distributed_Training
implemented_by	Implementation:Shiyu_coder_Kronos_Train_Model_Predictor_Qlib
last_updated	2026-02-09 14:00 GMT

Summary

Fine-tuning the autoregressive Transformer predictor using cross-entropy loss on next-token prediction with frozen tokenizer encoding on-the-fly.

Concept

The Predictor Finetuning principle describes the second stage of the Kronos fine-tuning pipeline, where the autoregressive Transformer model is trained to predict the next token in a sequence of discrete codes produced by the (already fine-tuned) tokenizer. This follows the standard language modeling paradigm applied to quantized time series data.

The key distinction from tokenizer fine-tuning is that here, the tokenizer is frozen (eval mode, no gradients) and used purely as an encoder to convert continuous financial data into discrete tokens on-the-fly during training.

Theory

Frozen Tokenizer Encoding

During predictor training, the fine-tuned KronosTokenizer is loaded, set to eval mode, and used within a torch.no_grad() context to encode each batch:

The tokenizer's encode(batch_x, half=True) method produces two token sequences: (s1, s2), representing stage-1 and stage-2 quantization codes respectively
These are integer sequences representing indices in the learned codebook
The half=True parameter indicates that only the encoding half of the tokenizer is used (no decoding)

Next-Token Prediction

Following the standard autoregressive paradigm:

Input tokens: tokens[:, :-1] (all tokens except the last)
Target tokens: tokens[:, 1:] (all tokens except the first)
The model learns to predict each token given all preceding tokens

This is applied independently to both the s1 and s2 token streams.

DualHead Cross-Entropy Loss

The Kronos model has a DualHead output layer that produces logits for both s1 and s2 tokens. The loss is computed as:

loss = DualHead.compute_loss(s1_logits, s2_logits, s1_targets, s2_targets)

This returns the combined cross-entropy loss across both heads, along with individual s1 and s2 losses for monitoring.

Optimization Strategy

Optimizer: AdamW with betas (0.9, 0.95) and weight decay 0.1
Learning rate: predictor_learning_rate (default 4e-5), lower than the tokenizer learning rate since the predictor is typically larger
Scheduler: OneCycleLR with 3% warmup
Gradient clipping: Max norm 3.0 (slightly higher than the tokenizer's 2.0)
No gradient accumulation: Unlike tokenizer training, the predictor training loop does not use gradient accumulation steps

Distributed Training

Uses the same DDP pattern as tokenizer training:

DistributedSampler for data sharding across ranks
Validation loss aggregated via dist.all_reduce(SUM)
Only rank 0 performs checkpointing, logging, and summary saving
Epoch seeding ensures reproducible sampling across ranks

Domains

Training: Model fine-tuning with next-token prediction objective
Autoregressive_Models: Transformer-based sequence prediction
Distributed_Training: DDP-based multi-GPU training

Heuristic Links

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment