Implementation:Shiyu coder Kronos QlibDataPreprocessor Usage
| Field | Value |
|---|---|
| implementation_name | QlibDataPreprocessor_Usage |
| type | API Doc |
| repository | https://github.com/shiyu-coder/Kronos |
| source_file | finetune/qlib_data_preprocess.py:L14-121 |
| implements | Principle:Shiyu_coder_Kronos_Qlib_Data_Preprocessing |
| last_updated | 2026-02-09 14:00 GMT |
Summary
The QlibDataPreprocessor class implements a three-step ETL pipeline that initializes the Qlib data provider, loads and transforms raw OHLCV data per symbol, and splits it into time-based train/val/test pickle files.
Class
QlibDataPreprocessor
API Signature
Three-step sequential usage:
QlibDataPreprocessor() -> QlibDataPreprocessor
.initialize_qlib() -> None
.load_qlib_data() -> None
.prepare_dataset() -> None
Import
from qlib_data_preprocess import QlibDataPreprocessor
Dependencies
qlib(Microsoft Qlib framework)pandasnumpypickletqdm
Input
- Qlib CN data directory (configured via
Config.qlib_data_path, default"~/.qlib/qlib_data/cn_data") - Instrument universe (configured via
Config.instrument, default"csi300")
Output
Three pickle files saved to Config.dataset_path:
train_data.pklval_data.pkltest_data.pkl
Each file contains a dict[symbol -> pd.DataFrame] where the DataFrame has columns [open, high, low, close, vol, amt] with a datetime index.
Constructor
def __init__(self):
self.config = Config()
self.data_fields = ['open', 'close', 'high', 'low', 'volume', 'vwap']
self.data = {} # dict to store processed data for each symbol
The constructor creates a Config instance internally and initializes the raw Qlib field list and an empty data dictionary.
Methods
initialize_qlib()
def initialize_qlib(self) -> None
Initializes the Qlib environment by calling qlib.init() with the configured data path and REG_CN (China A-share region).
load_qlib_data()
def load_qlib_data(self) -> None
Loads raw data from Qlib and processes it symbol by symbol:
- Uses
QlibDataLoaderto load all fields for the instrument universe - Adjusts the time range with buffers: start minus
lookback_window, end pluspredict_window - For each symbol:
- Pivots the table to have features as columns and datetime as index
- Renames Qlib fields (removes
$prefix) - Computes
vol(alias for volume) andamt(average price times volume) - Selects only
Config.feature_listcolumns - Drops rows with NaN values
- Filters out symbols with fewer than
lookback_window + predict_window + 1rows
- Stores valid symbols in
self.data
prepare_dataset()
def prepare_dataset(self) -> None
Splits the loaded data into train/val/test sets using time-based boolean masks:
- For each symbol, applies date range masks from
Config.train_time_range,Config.val_time_range, andConfig.test_time_range - Creates the output directory (
Config.dataset_path) if it does not exist - Serializes each split as a pickle file
Example Usage
from qlib_data_preprocess import QlibDataPreprocessor
preprocessor = QlibDataPreprocessor()
preprocessor.initialize_qlib()
preprocessor.load_qlib_data()
preprocessor.prepare_dataset()
This can also be run directly as a script:
cd finetune/
python qlib_data_preprocess.py
Feature Derivation Details
The amt (amount) feature is computed as:
amt = (open + high + low + close) / 4 * volume
This approximates the transaction amount using the average of OHLC prices multiplied by volume.
Source Reference
File: finetune/qlib_data_preprocess.py, lines 14-121.