Heuristic:Shiyu coder Kronos Two Stage Finetuning Strategy

Knowledge Sources	Kronos Internal
Domains	Training, Optimization
Last Updated	2026-02-09 13:47 GMT

Overview

Sequential two-stage finetuning strategy: train the VQ-VAE tokenizer first, then finetune the autoregressive predictor using the frozen finetuned tokenizer.

Description

Kronos employs a strict two-phase finetuning pipeline. In Phase 1, the VQ-VAE tokenizer is finetuned on the target domain data to adapt its codebook to the new data distribution. In Phase 2, the autoregressive predictor is finetuned using the frozen (non-trainable) finetuned tokenizer for on-the-fly token encoding. This ordering is critical: the predictor must learn to predict tokens from the adapted tokenizer, not the pretrained one. The tokenizer is always set to `eval()` mode during predictor training.

Usage

Use this heuristic when:

Finetuning Kronos on new data: Always train tokenizer first, then predictor
Debugging poor finetuning results: Verify tokenizer quality before blaming the predictor
Deciding whether to skip tokenizer finetuning: If your data is similar to the pretrained distribution, you may skip Phase 1 (configurable via `train_tokenizer` flag)

The Insight (Rule of Thumb)

Action: Always finetune the tokenizer (Phase 1) before the predictor (Phase 2). Never finetune them jointly.
Value: Phase 1 uses `tokenizer_learning_rate=2e-4`; Phase 2 uses `predictor_learning_rate=4e-5` (5x lower).
Trade-off: Sequential training takes longer than joint training but prevents the predictor from compensating for a poorly adapted tokenizer. The tokenizer learns the data distribution first (unsupervised reconstruction), then the predictor builds task-specific patterns on top.
Skip option: Both phases can be independently enabled/disabled via `train_tokenizer` and `train_basemodel` config flags.

Reasoning

The two-stage strategy mirrors the model's architecture: the tokenizer converts continuous OHLCV data into discrete tokens, and the predictor models token sequences autoregressively. If both were trained jointly, gradient signals from the predictor would interfere with the tokenizer's codebook learning, leading to unstable training. By freezing the tokenizer during Phase 2, the predictor receives a stable tokenization and can focus on learning temporal dependencies.

The 5x higher learning rate for the tokenizer (2e-4 vs 4e-5) reflects that the tokenizer needs more aggressive adaptation to learn the codebook distribution, while the predictor benefits from conservative updates that preserve pretrained sequence modeling capabilities.

Evidence from `finetune_csv/train_sequential.py:264-306`:

if self.config.train_tokenizer:
    success = self.train_tokenizer_phase()  # Phase 1
else:
    print("Skipping Tokenizer training phase")

if self.config.train_basemodel:
    success = self.train_basemodel_phase()   # Phase 2
else:
    print("Skipping Basemodel training phase")

Frozen tokenizer usage during predictor training from `finetune/train_predictor.py:99-101`:

# Tokenize input data on-the-fly
with torch.no_grad():
    token_seq_0, token_seq_1 = tokenizer.encode(batch_x, half=True)

Learning rate configuration from `finetune/config.py:57-59`:

# Learning rates for different model components.
self.tokenizer_learning_rate = 2e-4
self.predictor_learning_rate = 4e-5

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment