Principle:Shiyu coder Kronos Qlib Training Dataset

Field	Value
principle_name	Qlib_Training_Dataset
repository	https://github.com/shiyu-coder/Kronos
domains	Data_Loading, Time_Series, PyTorch
implemented_by	Implementation:Shiyu_coder_Kronos_QlibDataset_Usage
last_updated	2026-02-09 14:00 GMT

Summary

A PyTorch Dataset that pre-computes all valid sliding windows from pickled financial data and randomly samples from them during training epochs.

Concept

The Qlib Training Dataset principle defines how financial time series data is presented to the model during training. The core challenge is converting variable-length per-symbol time series into fixed-size windows suitable for batched training, while ensuring:

All valid windows across all symbols are accessible
Sampling is random but reproducible
Data normalization prevents information leakage
Training and validation use consistent but distinct sampling strategies

Theory

The dataset design follows a sliding window approach for time series with several important properties:

Pre-computed Index Enumeration

At initialization, the dataset enumerates all valid (symbol, start_index) pairs across the entire dataset. A valid pair is one where the window of size lookback_window + predict_window + 1 fits entirely within the symbol's available data. This pre-computation:

Avoids runtime boundary checking
Enables uniform sampling across all symbols regardless of their individual series lengths
Allows reporting the total number of available samples

Random Sampling with Epoch Seeding

Rather than iterating through samples sequentially, each call to __getitem__ draws a random index from the pre-computed pool using a dedicated Python random.Random instance. The seed of this RNG can be reset per epoch via set_epoch_seed(), which:

Ensures different random orderings across epochs
Guarantees reproducibility when combined with a fixed seed
Is essential for correctness in Distributed Data Parallel (DDP) training, where each rank must sample different data but in a reproducible manner

Instance-Level Normalization

Each window is normalized independently using its own mean and standard deviation (computed per feature across time steps). This is critical for financial data because:

Prevents data leakage: No information from outside the window is used
Handles non-stationarity: Financial time series have time-varying statistical properties
Clip bounds: After normalization, values are clipped to [-clip, clip] (default [-5.0, 5.0]) to suppress extreme outliers

Configurable Epoch Size

The effective dataset size per epoch is the minimum of a configured iteration count (n_train_iter or n_val_iter) and the total number of available samples. This allows controlling epoch length for large datasets where a full pass would be impractical.

Domains

Data_Loading: PyTorch Dataset and DataLoader integration
Time_Series: Sliding window construction and normalization
PyTorch: Custom Dataset implementation with DDP compatibility

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment