Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Shiyu coder Kronos Predictor Finetuning

From Leeroopedia


Field Value
principle_name Predictor_Finetuning
repository https://github.com/shiyu-coder/Kronos
domains Training, Autoregressive_Models, Distributed_Training
implemented_by Implementation:Shiyu_coder_Kronos_Train_Model_Predictor_Qlib
last_updated 2026-02-09 14:00 GMT

Summary

Fine-tuning the autoregressive Transformer predictor using cross-entropy loss on next-token prediction with frozen tokenizer encoding on-the-fly.

Concept

The Predictor Finetuning principle describes the second stage of the Kronos fine-tuning pipeline, where the autoregressive Transformer model is trained to predict the next token in a sequence of discrete codes produced by the (already fine-tuned) tokenizer. This follows the standard language modeling paradigm applied to quantized time series data.

The key distinction from tokenizer fine-tuning is that here, the tokenizer is frozen (eval mode, no gradients) and used purely as an encoder to convert continuous financial data into discrete tokens on-the-fly during training.

Theory

Frozen Tokenizer Encoding

During predictor training, the fine-tuned KronosTokenizer is loaded, set to eval mode, and used within a torch.no_grad() context to encode each batch:

  • The tokenizer's encode(batch_x, half=True) method produces two token sequences: (s1, s2), representing stage-1 and stage-2 quantization codes respectively
  • These are integer sequences representing indices in the learned codebook
  • The half=True parameter indicates that only the encoding half of the tokenizer is used (no decoding)

Next-Token Prediction

Following the standard autoregressive paradigm:

  • Input tokens: tokens[:, :-1] (all tokens except the last)
  • Target tokens: tokens[:, 1:] (all tokens except the first)
  • The model learns to predict each token given all preceding tokens

This is applied independently to both the s1 and s2 token streams.

DualHead Cross-Entropy Loss

The Kronos model has a DualHead output layer that produces logits for both s1 and s2 tokens. The loss is computed as:

loss = DualHead.compute_loss(s1_logits, s2_logits, s1_targets, s2_targets)

This returns the combined cross-entropy loss across both heads, along with individual s1 and s2 losses for monitoring.

Optimization Strategy

  • Optimizer: AdamW with betas (0.9, 0.95) and weight decay 0.1
  • Learning rate: predictor_learning_rate (default 4e-5), lower than the tokenizer learning rate since the predictor is typically larger
  • Scheduler: OneCycleLR with 3% warmup
  • Gradient clipping: Max norm 3.0 (slightly higher than the tokenizer's 2.0)
  • No gradient accumulation: Unlike tokenizer training, the predictor training loop does not use gradient accumulation steps

Distributed Training

Uses the same DDP pattern as tokenizer training:

  • DistributedSampler for data sharding across ranks
  • Validation loss aggregated via dist.all_reduce(SUM)
  • Only rank 0 performs checkpointing, logging, and summary saving
  • Epoch seeding ensures reproducible sampling across ranks

Domains

  • Training: Model fine-tuning with next-token prediction objective
  • Autoregressive_Models: Transformer-based sequence prediction
  • Distributed_Training: DDP-based multi-GPU training

See Also

Heuristic Links

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment