Principle:Shiyu coder Kronos Sequential Two Stage Training

Field	Value
Principle Name	Sequential_Two_Stage_Training
Repository	Shiyu_coder_Kronos
Repository URL	https://github.com/shiyu-coder/Kronos
Domains	Training, Pipeline_Orchestration, Transfer_Learning
Implemented By	Implementation:Shiyu_coder_Kronos_SequentialTrainer_Usage
Last Updated	2026-02-09 14:00 GMT

Overview

This principle describes the orchestration of a two-phase finetuning pipeline for the Kronos time series model, where the tokenizer (VQ-VAE) is trained first on reconstruction loss and then the predictor (autoregressive transformer) is trained second on next-token prediction loss with the tokenizer frozen.

Concept

The Kronos architecture consists of two distinct components:

Tokenizer (KronosTokenizer): A VQ-VAE encoder-decoder that learns to discretize continuous time series into token sequences.
Predictor (Kronos): An autoregressive transformer that models the distribution of token sequences for forecasting.

These two components must be trained sequentially because the predictor operates in the tokenizer's discrete representation space. The tokenizer must first learn meaningful token representations before the predictor can learn to model their sequential patterns.

Theory

Phase 1: Tokenizer Training (VQ-VAE Reconstruction)

Objective: Train the tokenizer to faithfully encode and reconstruct time series windows.
Loss: Reconstruction loss (VQ-VAE loss combining reconstruction, commitment, and codebook losses).
Input: Raw normalized OHLCV time series windows.
Output: Finetuned tokenizer that produces meaningful discrete token sequences for the target domain.
Initialization: Loads pretrained tokenizer weights (or random initialization if pre_trained_tokenizer=False).
Saved to: tokenizer_best_model_path (selected by best validation loss).

Phase 2: Predictor Training (Next-Token Prediction)

Objective: Train the predictor to model the sequential distribution of tokens produced by the (now-frozen) tokenizer.
Loss: Cross-entropy loss over predicted token logits vs. actual token sequences (two codebook streams: s1 and s2).
Input: Token sequences produced by the frozen finetuned tokenizer, plus temporal stamp features.
Output: Finetuned predictor capable of autoregressive token generation.
Initialization: Loads the finetuned tokenizer from Phase 1 (frozen) and pretrained predictor weights (or random initialization if pre_trained_predictor=False).
Saved to: basemodel_best_model_path (selected by best validation loss).

Why Sequential Training?

Representation stability: If both were trained jointly, the predictor would need to continuously adapt to a shifting token space as the tokenizer learns. Sequential training ensures the token space is stable when the predictor trains.
Transfer learning: Both components can start from pretrained checkpoints. The tokenizer adapts to domain-specific patterns first; then the predictor adapts to domain-specific token distributions.
Modularity: Each phase can be run, skipped, or resumed independently via configuration flags (train_tokenizer, train_basemodel, skip_existing).

Pipeline Flow

[YAML Config] --> CustomFinetuneConfig
                        |
                        v
              SequentialTrainer.__init__()
                        |
                        v
              SequentialTrainer.run_training()
                  |                    |
                  v                    v
    train_tokenizer_phase()   train_basemodel_phase()
          |                          |
          v                          v
    Load pretrained tokenizer   Load finetuned tokenizer (frozen)
    Train on reconstruction     Load pretrained predictor
    Save best tokenizer         Train on cross-entropy
                                Save best predictor

Phase Control

The following configuration flags control execution:

train_tokenizer: If False, skip Phase 1 entirely
train_basemodel: If False, skip Phase 2 entirely
skip_existing: If True, skip a phase if its output model already exists on disk
pre_trained_tokenizer: If False, randomly initialize tokenizer instead of loading pretrained weights
pre_trained_predictor: If False, randomly initialize predictor instead of loading pretrained weights

Distributed Training Support

The pipeline supports optional PyTorch Distributed Data Parallel (DDP) training:

Environment variables RANK, WORLD_SIZE, and LOCAL_RANK are read at initialization
dist.init_process_group() is called if world_size > 1
Models are wrapped in DDP during training
Only rank 0 saves model checkpoints

Heuristic Links

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment