Workflow:Shiyu coder Kronos CSV Finetuning

Knowledge Sources	Kronos HuggingFace Models Kronos Paper
Domains	Financial_Forecasting, Fine_Tuning, LLMs
Last Updated	2026-02-09 14:00 GMT

Overview

End-to-end pipeline for finetuning the Kronos foundation model on custom CSV-formatted financial data, using YAML-based configuration and automated two-stage sequential training.

Description

This workflow provides an accessible finetuning pipeline that works with standard CSV data files rather than requiring the Qlib data infrastructure. Users prepare their own OHLCV candlestick data in CSV format, configure training through a YAML file, and run a single sequential training script that handles both stages automatically. Stage 1 finetunes the KronosTokenizer to adapt to the target data distribution. Stage 2 finetunes the Kronos predictor model using the frozen finetuned tokenizer. The SequentialTrainer class orchestrates both phases, with options to skip stages, resume from existing checkpoints, and run on single or multiple GPUs via DDP.

Usage

Execute this workflow when you have custom financial data in CSV format (any market, any frequency) and want to specialize a pre-trained Kronos model for that data. This is the recommended path for users who do not need the Qlib data framework or who work with non-Chinese market data. The CSV must contain columns: timestamps, open, high, low, close, volume, amount (volume and amount can be zero if unavailable). The pipeline automatically splits data into train/validation sets by ratio.

Execution Steps

Step 1: Prepare CSV Data

Format your historical candlestick data as a CSV file with required columns: timestamps, open, high, low, close, volume, amount. Each row represents one time period (candle). The timestamps column must be parseable as datetime. Volume and amount can be set to zero if not available for your data source.

Key considerations:

Column order does not matter as long as the column names match
Timestamps should be sorted chronologically
The data should cover enough history for meaningful training (the pipeline auto-splits by ratio)
Any financial instrument and frequency (1min, 5min, daily, etc.) can be used

Step 2: Configure YAML

Create or edit a YAML configuration file specifying data paths, training hyperparameters, model paths, and experiment settings. The configuration is organized into sections: data (path, window sizes, split ratios), training (epochs, batch size, learning rates), model_paths (pretrained model locations, save paths), experiment (flags for which stages to run), and device settings.

Key considerations:

Set data_path to your CSV file location
Set pretrained_tokenizer and pretrained_predictor to local paths or HuggingFace Hub identifiers
Configure lookback_window and predict_window to match your forecasting horizon
The config_loader validates all settings and auto-generates derived paths from exp_name
Set train_tokenizer and train_basemodel flags to control which stages run

Step 3: Train Tokenizer

The SequentialTrainer loads the pre-trained KronosTokenizer and finetunes it on your CSV data. The CSV is loaded, split into train/validation sets by the configured ratio, and windowed into overlapping samples. Training optimizes reconstruction loss (MSE between input and reconstructed OHLCV) plus BSQ quantization loss. The best tokenizer checkpoint is saved based on validation loss.

Key considerations:

This stage adapts the tokenizer's quantization codebook to your data distribution
Default learning rate is 2e-4 with AdamW optimizer
Gradient accumulation is configurable for limited GPU memory
The tokenizer can be initialized from pretrained weights or from scratch (using architecture config only)
Skip this stage with --skip-tokenizer flag if you already have a finetuned tokenizer

Step 4: Train Predictor

The SequentialTrainer loads the finetuned tokenizer (from Step 3) and the pre-trained Kronos predictor, then finetunes the predictor on token sequences. The frozen tokenizer encodes training windows into s1/s2 token pairs, and the predictor learns next-token prediction via cross-entropy loss on both token heads. The best predictor checkpoint is saved based on validation loss.

Key considerations:

Requires the finetuned tokenizer from Step 3 to exist
The tokenizer is frozen during predictor training (no gradient updates)
Default learning rate is 1e-6 (much lower than tokenizer training)
The predictor can also be initialized from scratch using architecture config
Skip this stage with --skip-basemodel flag if you only want to finetune the tokenizer
Supports DDP multi-GPU training via torchrun for larger datasets

Step 5: Validate Results

After training completes, load the finetuned tokenizer and predictor using the standard KronosPredictor interface and run predictions on held-out data. Compare predicted candlesticks against actual values to assess finetuning quality. The finetuned models are saved in HuggingFace-compatible format and can be loaded with from_pretrained.

Key considerations:

Models are saved to {base_save_path}/{exp_name}/tokenizer/best_model/ and basemodel/best_model/
Use the standard KronosPredictor.predict() method for inference with finetuned models
Training logs are saved to {base_save_path}/logs/ for debugging
The skip_existing flag allows resuming without retraining completed stages

Execution Diagram

GitHub URL

Workflow Repository