Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Shiyu coder Kronos Qlib Data Preprocessing

From Leeroopedia


Field Value
principle_name Qlib_Data_Preprocessing
repository https://github.com/shiyu-coder/Kronos
domains Data_Engineering, Financial_Data, ETL
implemented_by Implementation:Shiyu_coder_Kronos_QlibDataPreprocessor_Usage
last_updated 2026-02-09 14:00 GMT

Summary

Extracting, transforming, and splitting financial market data from Microsoft Qlib into train/val/test pickle files for model training.

Concept

The Qlib Data Preprocessing principle defines a structured ETL (Extract, Transform, Load) pipeline for converting raw financial time series from the Qlib data platform into ready-to-use training datasets. The pipeline handles the complexities of:

  • Initializing and connecting to the Qlib data provider
  • Extracting per-symbol OHLCV data across a configurable time range
  • Computing derived features from raw market data
  • Filtering out symbols with insufficient data
  • Splitting data into time-based train/validation/test partitions
  • Serializing the results to disk as pickle files

Theory

The pipeline follows a classic ETL pattern adapted for financial time series:

Extract

Raw data is loaded from Microsoft Qlib's data provider using QlibDataLoader. The Qlib framework abstracts away the underlying data storage, providing a unified API for accessing Chinese A-share market data. The extraction adjusts the time range to include buffer periods:

  • Start buffer: Subtracts lookback_window from the dataset start time to ensure the first training sample has sufficient history.
  • End buffer: Adds predict_window to the dataset end time to ensure the last sample has sufficient future data.

Transform

For each symbol in the instrument universe, the pipeline:

  • Reshapes the multi-level DataFrame into a per-symbol format with features as columns
  • Renames raw Qlib field names (e.g., $volume to volume)
  • Computes derived features: vol (volume alias) and amt (estimated transaction amount from average price times volume)
  • Drops rows with missing data
  • Filters symbols where the available data length is less than lookback_window + predict_window + 1

Load

The transformed data is split into three time-based partitions using boolean masks on the datetime index. Each partition is serialized as a pickle file containing a dict[symbol -> pd.DataFrame].

Key Design Decisions

  • Time-based splitting: Train/val/test splits use date ranges rather than random sampling, which is standard in financial ML to avoid look-ahead bias.
  • Overlapping time ranges: Validation and test ranges intentionally start before the training/validation ranges end. This overlap accounts for the lookback window needed at the boundary of each split.
  • Per-symbol storage: Data is stored as a dictionary keyed by symbol, enabling efficient per-symbol access during dataset construction.

Domains

  • Data_Engineering: ETL pipeline design for data processing
  • Financial_Data: Market data extraction and feature engineering
  • ETL: Extract-Transform-Load pattern for ML data preparation

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment