Principle:Shiyu coder Kronos Qlib Data Preprocessing
| Field | Value |
|---|---|
| principle_name | Qlib_Data_Preprocessing |
| repository | https://github.com/shiyu-coder/Kronos |
| domains | Data_Engineering, Financial_Data, ETL |
| implemented_by | Implementation:Shiyu_coder_Kronos_QlibDataPreprocessor_Usage |
| last_updated | 2026-02-09 14:00 GMT |
Summary
Extracting, transforming, and splitting financial market data from Microsoft Qlib into train/val/test pickle files for model training.
Concept
The Qlib Data Preprocessing principle defines a structured ETL (Extract, Transform, Load) pipeline for converting raw financial time series from the Qlib data platform into ready-to-use training datasets. The pipeline handles the complexities of:
- Initializing and connecting to the Qlib data provider
- Extracting per-symbol OHLCV data across a configurable time range
- Computing derived features from raw market data
- Filtering out symbols with insufficient data
- Splitting data into time-based train/validation/test partitions
- Serializing the results to disk as pickle files
Theory
The pipeline follows a classic ETL pattern adapted for financial time series:
Extract
Raw data is loaded from Microsoft Qlib's data provider using QlibDataLoader. The Qlib framework abstracts away the underlying data storage, providing a unified API for accessing Chinese A-share market data. The extraction adjusts the time range to include buffer periods:
- Start buffer: Subtracts
lookback_windowfrom the dataset start time to ensure the first training sample has sufficient history. - End buffer: Adds
predict_windowto the dataset end time to ensure the last sample has sufficient future data.
Transform
For each symbol in the instrument universe, the pipeline:
- Reshapes the multi-level DataFrame into a per-symbol format with features as columns
- Renames raw Qlib field names (e.g.,
$volumetovolume) - Computes derived features:
vol(volume alias) andamt(estimated transaction amount from average price times volume) - Drops rows with missing data
- Filters symbols where the available data length is less than
lookback_window + predict_window + 1
Load
The transformed data is split into three time-based partitions using boolean masks on the datetime index. Each partition is serialized as a pickle file containing a dict[symbol -> pd.DataFrame].
Key Design Decisions
- Time-based splitting: Train/val/test splits use date ranges rather than random sampling, which is standard in financial ML to avoid look-ahead bias.
- Overlapping time ranges: Validation and test ranges intentionally start before the training/validation ranges end. This overlap accounts for the lookback window needed at the boundary of each split.
- Per-symbol storage: Data is stored as a dictionary keyed by symbol, enabling efficient per-symbol access during dataset construction.
Domains
- Data_Engineering: ETL pipeline design for data processing
- Financial_Data: Market data extraction and feature engineering
- ETL: Extract-Transform-Load pattern for ML data preparation