Principle:Shiyu coder Kronos Qlib Data Preprocessing

Field	Value
principle_name	Qlib_Data_Preprocessing
repository	https://github.com/shiyu-coder/Kronos
domains	Data_Engineering, Financial_Data, ETL
implemented_by	Implementation:Shiyu_coder_Kronos_QlibDataPreprocessor_Usage
last_updated	2026-02-09 14:00 GMT

Summary

Extracting, transforming, and splitting financial market data from Microsoft Qlib into train/val/test pickle files for model training.

Concept

The Qlib Data Preprocessing principle defines a structured ETL (Extract, Transform, Load) pipeline for converting raw financial time series from the Qlib data platform into ready-to-use training datasets. The pipeline handles the complexities of:

Initializing and connecting to the Qlib data provider
Extracting per-symbol OHLCV data across a configurable time range
Computing derived features from raw market data
Filtering out symbols with insufficient data
Splitting data into time-based train/validation/test partitions
Serializing the results to disk as pickle files

Theory

The pipeline follows a classic ETL pattern adapted for financial time series:

Extract

Raw data is loaded from Microsoft Qlib's data provider using QlibDataLoader. The Qlib framework abstracts away the underlying data storage, providing a unified API for accessing Chinese A-share market data. The extraction adjusts the time range to include buffer periods:

Start buffer: Subtracts lookback_window from the dataset start time to ensure the first training sample has sufficient history.
End buffer: Adds predict_window to the dataset end time to ensure the last sample has sufficient future data.

Transform

For each symbol in the instrument universe, the pipeline:

Reshapes the multi-level DataFrame into a per-symbol format with features as columns
Renames raw Qlib field names (e.g., $volume to volume)
Computes derived features: vol (volume alias) and amt (estimated transaction amount from average price times volume)
Drops rows with missing data
Filters symbols where the available data length is less than lookback_window + predict_window + 1

Load

The transformed data is split into three time-based partitions using boolean masks on the datetime index. Each partition is serialized as a pickle file containing a dict[symbol -> pd.DataFrame].

Key Design Decisions

Time-based splitting: Train/val/test splits use date ranges rather than random sampling, which is standard in financial ML to avoid look-ahead bias.
Overlapping time ranges: Validation and test ranges intentionally start before the training/validation ranges end. This overlap accounts for the lookback window needed at the boundary of each split.
Per-symbol storage: Data is stored as a dictionary keyed by symbol, enabling efficient per-symbol access during dataset construction.

Domains

Data_Engineering: ETL pipeline design for data processing
Financial_Data: Market data extraction and feature engineering
ETL: Extract-Transform-Load pattern for ML data preparation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment