Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Shiyu coder Kronos QlibDataPreprocessor Usage

From Leeroopedia


Field Value
implementation_name QlibDataPreprocessor_Usage
type API Doc
repository https://github.com/shiyu-coder/Kronos
source_file finetune/qlib_data_preprocess.py:L14-121
implements Principle:Shiyu_coder_Kronos_Qlib_Data_Preprocessing
last_updated 2026-02-09 14:00 GMT

Summary

The QlibDataPreprocessor class implements a three-step ETL pipeline that initializes the Qlib data provider, loads and transforms raw OHLCV data per symbol, and splits it into time-based train/val/test pickle files.

Class

QlibDataPreprocessor

API Signature

Three-step sequential usage:

QlibDataPreprocessor() -> QlibDataPreprocessor
.initialize_qlib() -> None
.load_qlib_data() -> None
.prepare_dataset() -> None

Import

from qlib_data_preprocess import QlibDataPreprocessor

Dependencies

  • qlib (Microsoft Qlib framework)
  • pandas
  • numpy
  • pickle
  • tqdm

Input

  • Qlib CN data directory (configured via Config.qlib_data_path, default "~/.qlib/qlib_data/cn_data")
  • Instrument universe (configured via Config.instrument, default "csi300")

Output

Three pickle files saved to Config.dataset_path:

  • train_data.pkl
  • val_data.pkl
  • test_data.pkl

Each file contains a dict[symbol -> pd.DataFrame] where the DataFrame has columns [open, high, low, close, vol, amt] with a datetime index.

Constructor

def __init__(self):
    self.config = Config()
    self.data_fields = ['open', 'close', 'high', 'low', 'volume', 'vwap']
    self.data = {}  # dict to store processed data for each symbol

The constructor creates a Config instance internally and initializes the raw Qlib field list and an empty data dictionary.

Methods

initialize_qlib()

def initialize_qlib(self) -> None

Initializes the Qlib environment by calling qlib.init() with the configured data path and REG_CN (China A-share region).

load_qlib_data()

def load_qlib_data(self) -> None

Loads raw data from Qlib and processes it symbol by symbol:

  • Uses QlibDataLoader to load all fields for the instrument universe
  • Adjusts the time range with buffers: start minus lookback_window, end plus predict_window
  • For each symbol:
    • Pivots the table to have features as columns and datetime as index
    • Renames Qlib fields (removes $ prefix)
    • Computes vol (alias for volume) and amt (average price times volume)
    • Selects only Config.feature_list columns
    • Drops rows with NaN values
    • Filters out symbols with fewer than lookback_window + predict_window + 1 rows
  • Stores valid symbols in self.data

prepare_dataset()

def prepare_dataset(self) -> None

Splits the loaded data into train/val/test sets using time-based boolean masks:

  • For each symbol, applies date range masks from Config.train_time_range, Config.val_time_range, and Config.test_time_range
  • Creates the output directory (Config.dataset_path) if it does not exist
  • Serializes each split as a pickle file

Example Usage

from qlib_data_preprocess import QlibDataPreprocessor

preprocessor = QlibDataPreprocessor()
preprocessor.initialize_qlib()
preprocessor.load_qlib_data()
preprocessor.prepare_dataset()

This can also be run directly as a script:

cd finetune/
python qlib_data_preprocess.py

Feature Derivation Details

The amt (amount) feature is computed as:

amt = (open + high + low + close) / 4 * volume

This approximates the transaction amount using the average of OHLC prices multiplied by volume.

Source Reference

File: finetune/qlib_data_preprocess.py, lines 14-121.

Environment & Heuristic Links

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment