Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Shiyu coder Kronos Sequential Two Stage Training

From Leeroopedia


Field Value
Principle Name Sequential_Two_Stage_Training
Repository Shiyu_coder_Kronos
Repository URL https://github.com/shiyu-coder/Kronos
Domains Training, Pipeline_Orchestration, Transfer_Learning
Implemented By Implementation:Shiyu_coder_Kronos_SequentialTrainer_Usage
Last Updated 2026-02-09 14:00 GMT

Overview

This principle describes the orchestration of a two-phase finetuning pipeline for the Kronos time series model, where the tokenizer (VQ-VAE) is trained first on reconstruction loss and then the predictor (autoregressive transformer) is trained second on next-token prediction loss with the tokenizer frozen.

Concept

The Kronos architecture consists of two distinct components:

  • Tokenizer (KronosTokenizer): A VQ-VAE encoder-decoder that learns to discretize continuous time series into token sequences.
  • Predictor (Kronos): An autoregressive transformer that models the distribution of token sequences for forecasting.

These two components must be trained sequentially because the predictor operates in the tokenizer's discrete representation space. The tokenizer must first learn meaningful token representations before the predictor can learn to model their sequential patterns.

Theory

Phase 1: Tokenizer Training (VQ-VAE Reconstruction)

  • Objective: Train the tokenizer to faithfully encode and reconstruct time series windows.
  • Loss: Reconstruction loss (VQ-VAE loss combining reconstruction, commitment, and codebook losses).
  • Input: Raw normalized OHLCV time series windows.
  • Output: Finetuned tokenizer that produces meaningful discrete token sequences for the target domain.
  • Initialization: Loads pretrained tokenizer weights (or random initialization if pre_trained_tokenizer=False).
  • Saved to: tokenizer_best_model_path (selected by best validation loss).

Phase 2: Predictor Training (Next-Token Prediction)

  • Objective: Train the predictor to model the sequential distribution of tokens produced by the (now-frozen) tokenizer.
  • Loss: Cross-entropy loss over predicted token logits vs. actual token sequences (two codebook streams: s1 and s2).
  • Input: Token sequences produced by the frozen finetuned tokenizer, plus temporal stamp features.
  • Output: Finetuned predictor capable of autoregressive token generation.
  • Initialization: Loads the finetuned tokenizer from Phase 1 (frozen) and pretrained predictor weights (or random initialization if pre_trained_predictor=False).
  • Saved to: basemodel_best_model_path (selected by best validation loss).

Why Sequential Training?

  • Representation stability: If both were trained jointly, the predictor would need to continuously adapt to a shifting token space as the tokenizer learns. Sequential training ensures the token space is stable when the predictor trains.
  • Transfer learning: Both components can start from pretrained checkpoints. The tokenizer adapts to domain-specific patterns first; then the predictor adapts to domain-specific token distributions.
  • Modularity: Each phase can be run, skipped, or resumed independently via configuration flags (train_tokenizer, train_basemodel, skip_existing).

Pipeline Flow

[YAML Config] --> CustomFinetuneConfig
                        |
                        v
              SequentialTrainer.__init__()
                        |
                        v
              SequentialTrainer.run_training()
                  |                    |
                  v                    v
    train_tokenizer_phase()   train_basemodel_phase()
          |                          |
          v                          v
    Load pretrained tokenizer   Load finetuned tokenizer (frozen)
    Train on reconstruction     Load pretrained predictor
    Save best tokenizer         Train on cross-entropy
                                Save best predictor

Phase Control

The following configuration flags control execution:

  • train_tokenizer: If False, skip Phase 1 entirely
  • train_basemodel: If False, skip Phase 2 entirely
  • skip_existing: If True, skip a phase if its output model already exists on disk
  • pre_trained_tokenizer: If False, randomly initialize tokenizer instead of loading pretrained weights
  • pre_trained_predictor: If False, randomly initialize predictor instead of loading pretrained weights

Distributed Training Support

The pipeline supports optional PyTorch Distributed Data Parallel (DDP) training:

  • Environment variables RANK, WORLD_SIZE, and LOCAL_RANK are read at initialization
  • dist.init_process_group() is called if world_size > 1
  • Models are wrapped in DDP during training
  • Only rank 0 saves model checkpoints

See Also

Heuristic Links

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment