Principle:Shiyu coder Kronos Qlib Training Dataset
| Field | Value |
|---|---|
| principle_name | Qlib_Training_Dataset |
| repository | https://github.com/shiyu-coder/Kronos |
| domains | Data_Loading, Time_Series, PyTorch |
| implemented_by | Implementation:Shiyu_coder_Kronos_QlibDataset_Usage |
| last_updated | 2026-02-09 14:00 GMT |
Summary
A PyTorch Dataset that pre-computes all valid sliding windows from pickled financial data and randomly samples from them during training epochs.
Concept
The Qlib Training Dataset principle defines how financial time series data is presented to the model during training. The core challenge is converting variable-length per-symbol time series into fixed-size windows suitable for batched training, while ensuring:
- All valid windows across all symbols are accessible
- Sampling is random but reproducible
- Data normalization prevents information leakage
- Training and validation use consistent but distinct sampling strategies
Theory
The dataset design follows a sliding window approach for time series with several important properties:
Pre-computed Index Enumeration
At initialization, the dataset enumerates all valid (symbol, start_index) pairs across the entire dataset. A valid pair is one where the window of size lookback_window + predict_window + 1 fits entirely within the symbol's available data. This pre-computation:
- Avoids runtime boundary checking
- Enables uniform sampling across all symbols regardless of their individual series lengths
- Allows reporting the total number of available samples
Random Sampling with Epoch Seeding
Rather than iterating through samples sequentially, each call to __getitem__ draws a random index from the pre-computed pool using a dedicated Python random.Random instance. The seed of this RNG can be reset per epoch via set_epoch_seed(), which:
- Ensures different random orderings across epochs
- Guarantees reproducibility when combined with a fixed seed
- Is essential for correctness in Distributed Data Parallel (DDP) training, where each rank must sample different data but in a reproducible manner
Instance-Level Normalization
Each window is normalized independently using its own mean and standard deviation (computed per feature across time steps). This is critical for financial data because:
- Prevents data leakage: No information from outside the window is used
- Handles non-stationarity: Financial time series have time-varying statistical properties
- Clip bounds: After normalization, values are clipped to
[-clip, clip](default[-5.0, 5.0]) to suppress extreme outliers
Configurable Epoch Size
The effective dataset size per epoch is the minimum of a configured iteration count (n_train_iter or n_val_iter) and the total number of available samples. This allows controlling epoch length for large datasets where a full pass would be impractical.
Domains
- Data_Loading: PyTorch Dataset and DataLoader integration
- Time_Series: Sliding window construction and normalization
- PyTorch: Custom Dataset implementation with DDP compatibility