Implementation:Shiyu coder Kronos QlibDataPreprocessor Usage

Field	Value
implementation_name	QlibDataPreprocessor_Usage
type	API Doc
repository	https://github.com/shiyu-coder/Kronos
source_file	finetune/qlib_data_preprocess.py:L14-121
implements	Principle:Shiyu_coder_Kronos_Qlib_Data_Preprocessing
last_updated	2026-02-09 14:00 GMT

Summary

The QlibDataPreprocessor class implements a three-step ETL pipeline that initializes the Qlib data provider, loads and transforms raw OHLCV data per symbol, and splits it into time-based train/val/test pickle files.

Class

QlibDataPreprocessor

API Signature

Three-step sequential usage:

QlibDataPreprocessor() -> QlibDataPreprocessor
.initialize_qlib() -> None
.load_qlib_data() -> None
.prepare_dataset() -> None

Import

from qlib_data_preprocess import QlibDataPreprocessor

Dependencies

qlib (Microsoft Qlib framework)
pandas
numpy
pickle
tqdm

Input

Qlib CN data directory (configured via Config.qlib_data_path, default "~/.qlib/qlib_data/cn_data")
Instrument universe (configured via Config.instrument, default "csi300")

Output

Three pickle files saved to Config.dataset_path:

train_data.pkl
val_data.pkl
test_data.pkl

Each file contains a dict[symbol -> pd.DataFrame] where the DataFrame has columns [open, high, low, close, vol, amt] with a datetime index.

Constructor

def __init__(self):
    self.config = Config()
    self.data_fields = ['open', 'close', 'high', 'low', 'volume', 'vwap']
    self.data = {}  # dict to store processed data for each symbol

The constructor creates a Config instance internally and initializes the raw Qlib field list and an empty data dictionary.

Methods

initialize_qlib()

def initialize_qlib(self) -> None

Initializes the Qlib environment by calling qlib.init() with the configured data path and REG_CN (China A-share region).

load_qlib_data()

def load_qlib_data(self) -> None

Loads raw data from Qlib and processes it symbol by symbol:

Uses QlibDataLoader to load all fields for the instrument universe
Adjusts the time range with buffers: start minus lookback_window, end plus predict_window
For each symbol:
- Pivots the table to have features as columns and datetime as index
- Renames Qlib fields (removes $ prefix)
- Computes vol (alias for volume) and amt (average price times volume)
- Selects only Config.feature_list columns
- Drops rows with NaN values
- Filters out symbols with fewer than lookback_window + predict_window + 1 rows
Stores valid symbols in self.data

prepare_dataset()

def prepare_dataset(self) -> None

Splits the loaded data into train/val/test sets using time-based boolean masks:

For each symbol, applies date range masks from Config.train_time_range, Config.val_time_range, and Config.test_time_range
Creates the output directory (Config.dataset_path) if it does not exist
Serializes each split as a pickle file

Example Usage

from qlib_data_preprocess import QlibDataPreprocessor

preprocessor = QlibDataPreprocessor()
preprocessor.initialize_qlib()
preprocessor.load_qlib_data()
preprocessor.prepare_dataset()

This can also be run directly as a script:

cd finetune/
python qlib_data_preprocess.py

Feature Derivation Details

The amt (amount) feature is computed as:

amt = (open + high + low + close) / 4 * volume

This approximates the transaction amount using the average of OHLC prices multiplied by volume.

Source Reference

File: finetune/qlib_data_preprocess.py, lines 14-121.

Environment & Heuristic Links

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment