Principle:Shiyu coder Kronos Sequential Two Stage Training
| Field | Value |
|---|---|
| Principle Name | Sequential_Two_Stage_Training |
| Repository | Shiyu_coder_Kronos |
| Repository URL | https://github.com/shiyu-coder/Kronos |
| Domains | Training, Pipeline_Orchestration, Transfer_Learning |
| Implemented By | Implementation:Shiyu_coder_Kronos_SequentialTrainer_Usage |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
This principle describes the orchestration of a two-phase finetuning pipeline for the Kronos time series model, where the tokenizer (VQ-VAE) is trained first on reconstruction loss and then the predictor (autoregressive transformer) is trained second on next-token prediction loss with the tokenizer frozen.
Concept
The Kronos architecture consists of two distinct components:
- Tokenizer (KronosTokenizer): A VQ-VAE encoder-decoder that learns to discretize continuous time series into token sequences.
- Predictor (Kronos): An autoregressive transformer that models the distribution of token sequences for forecasting.
These two components must be trained sequentially because the predictor operates in the tokenizer's discrete representation space. The tokenizer must first learn meaningful token representations before the predictor can learn to model their sequential patterns.
Theory
Phase 1: Tokenizer Training (VQ-VAE Reconstruction)
- Objective: Train the tokenizer to faithfully encode and reconstruct time series windows.
- Loss: Reconstruction loss (VQ-VAE loss combining reconstruction, commitment, and codebook losses).
- Input: Raw normalized OHLCV time series windows.
- Output: Finetuned tokenizer that produces meaningful discrete token sequences for the target domain.
- Initialization: Loads pretrained tokenizer weights (or random initialization if
pre_trained_tokenizer=False). - Saved to:
tokenizer_best_model_path(selected by best validation loss).
Phase 2: Predictor Training (Next-Token Prediction)
- Objective: Train the predictor to model the sequential distribution of tokens produced by the (now-frozen) tokenizer.
- Loss: Cross-entropy loss over predicted token logits vs. actual token sequences (two codebook streams: s1 and s2).
- Input: Token sequences produced by the frozen finetuned tokenizer, plus temporal stamp features.
- Output: Finetuned predictor capable of autoregressive token generation.
- Initialization: Loads the finetuned tokenizer from Phase 1 (frozen) and pretrained predictor weights (or random initialization if
pre_trained_predictor=False). - Saved to:
basemodel_best_model_path(selected by best validation loss).
Why Sequential Training?
- Representation stability: If both were trained jointly, the predictor would need to continuously adapt to a shifting token space as the tokenizer learns. Sequential training ensures the token space is stable when the predictor trains.
- Transfer learning: Both components can start from pretrained checkpoints. The tokenizer adapts to domain-specific patterns first; then the predictor adapts to domain-specific token distributions.
- Modularity: Each phase can be run, skipped, or resumed independently via configuration flags (
train_tokenizer,train_basemodel,skip_existing).
Pipeline Flow
[YAML Config] --> CustomFinetuneConfig
|
v
SequentialTrainer.__init__()
|
v
SequentialTrainer.run_training()
| |
v v
train_tokenizer_phase() train_basemodel_phase()
| |
v v
Load pretrained tokenizer Load finetuned tokenizer (frozen)
Train on reconstruction Load pretrained predictor
Save best tokenizer Train on cross-entropy
Save best predictor
Phase Control
The following configuration flags control execution:
- train_tokenizer: If False, skip Phase 1 entirely
- train_basemodel: If False, skip Phase 2 entirely
- skip_existing: If True, skip a phase if its output model already exists on disk
- pre_trained_tokenizer: If False, randomly initialize tokenizer instead of loading pretrained weights
- pre_trained_predictor: If False, randomly initialize predictor instead of loading pretrained weights
Distributed Training Support
The pipeline supports optional PyTorch Distributed Data Parallel (DDP) training:
- Environment variables
RANK,WORLD_SIZE, andLOCAL_RANKare read at initialization dist.init_process_group()is called if world_size > 1- Models are wrapped in DDP during training
- Only rank 0 saves model checkpoints
See Also
- Implementation:Shiyu_coder_Kronos_SequentialTrainer_Usage -- API documentation for SequentialTrainer
- Principle:Shiyu_coder_Kronos_CSV_Finetuning_Configuration -- Configuration consumed by the training pipeline
- Principle:Shiyu_coder_Kronos_CSV_Dataset_Handling -- Data loading used during training