Workflow:Shiyu coder Kronos CSV Finetuning
| Knowledge Sources | |
|---|---|
| Domains | Financial_Forecasting, Fine_Tuning, LLMs |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
End-to-end pipeline for finetuning the Kronos foundation model on custom CSV-formatted financial data, using YAML-based configuration and automated two-stage sequential training.
Description
This workflow provides an accessible finetuning pipeline that works with standard CSV data files rather than requiring the Qlib data infrastructure. Users prepare their own OHLCV candlestick data in CSV format, configure training through a YAML file, and run a single sequential training script that handles both stages automatically. Stage 1 finetunes the KronosTokenizer to adapt to the target data distribution. Stage 2 finetunes the Kronos predictor model using the frozen finetuned tokenizer. The SequentialTrainer class orchestrates both phases, with options to skip stages, resume from existing checkpoints, and run on single or multiple GPUs via DDP.
Usage
Execute this workflow when you have custom financial data in CSV format (any market, any frequency) and want to specialize a pre-trained Kronos model for that data. This is the recommended path for users who do not need the Qlib data framework or who work with non-Chinese market data. The CSV must contain columns: timestamps, open, high, low, close, volume, amount (volume and amount can be zero if unavailable). The pipeline automatically splits data into train/validation sets by ratio.
Execution Steps
Step 1: Prepare CSV Data
Format your historical candlestick data as a CSV file with required columns: timestamps, open, high, low, close, volume, amount. Each row represents one time period (candle). The timestamps column must be parseable as datetime. Volume and amount can be set to zero if not available for your data source.
Key considerations:
- Column order does not matter as long as the column names match
- Timestamps should be sorted chronologically
- The data should cover enough history for meaningful training (the pipeline auto-splits by ratio)
- Any financial instrument and frequency (1min, 5min, daily, etc.) can be used
Step 2: Configure YAML
Create or edit a YAML configuration file specifying data paths, training hyperparameters, model paths, and experiment settings. The configuration is organized into sections: data (path, window sizes, split ratios), training (epochs, batch size, learning rates), model_paths (pretrained model locations, save paths), experiment (flags for which stages to run), and device settings.
Key considerations:
- Set data_path to your CSV file location
- Set pretrained_tokenizer and pretrained_predictor to local paths or HuggingFace Hub identifiers
- Configure lookback_window and predict_window to match your forecasting horizon
- The config_loader validates all settings and auto-generates derived paths from exp_name
- Set train_tokenizer and train_basemodel flags to control which stages run
Step 3: Train Tokenizer
The SequentialTrainer loads the pre-trained KronosTokenizer and finetunes it on your CSV data. The CSV is loaded, split into train/validation sets by the configured ratio, and windowed into overlapping samples. Training optimizes reconstruction loss (MSE between input and reconstructed OHLCV) plus BSQ quantization loss. The best tokenizer checkpoint is saved based on validation loss.
Key considerations:
- This stage adapts the tokenizer's quantization codebook to your data distribution
- Default learning rate is 2e-4 with AdamW optimizer
- Gradient accumulation is configurable for limited GPU memory
- The tokenizer can be initialized from pretrained weights or from scratch (using architecture config only)
- Skip this stage with --skip-tokenizer flag if you already have a finetuned tokenizer
Step 4: Train Predictor
The SequentialTrainer loads the finetuned tokenizer (from Step 3) and the pre-trained Kronos predictor, then finetunes the predictor on token sequences. The frozen tokenizer encodes training windows into s1/s2 token pairs, and the predictor learns next-token prediction via cross-entropy loss on both token heads. The best predictor checkpoint is saved based on validation loss.
Key considerations:
- Requires the finetuned tokenizer from Step 3 to exist
- The tokenizer is frozen during predictor training (no gradient updates)
- Default learning rate is 1e-6 (much lower than tokenizer training)
- The predictor can also be initialized from scratch using architecture config
- Skip this stage with --skip-basemodel flag if you only want to finetune the tokenizer
- Supports DDP multi-GPU training via torchrun for larger datasets
Step 5: Validate Results
After training completes, load the finetuned tokenizer and predictor using the standard KronosPredictor interface and run predictions on held-out data. Compare predicted candlesticks against actual values to assess finetuning quality. The finetuned models are saved in HuggingFace-compatible format and can be loaded with from_pretrained.
Key considerations:
- Models are saved to {base_save_path}/{exp_name}/tokenizer/best_model/ and basemodel/best_model/
- Use the standard KronosPredictor.predict() method for inference with finetuned models
- Training logs are saved to {base_save_path}/logs/ for debugging
- The skip_existing flag allows resuming without retraining completed stages