Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Shiyu coder Kronos Qlib Training Dataset

From Leeroopedia


Field Value
principle_name Qlib_Training_Dataset
repository https://github.com/shiyu-coder/Kronos
domains Data_Loading, Time_Series, PyTorch
implemented_by Implementation:Shiyu_coder_Kronos_QlibDataset_Usage
last_updated 2026-02-09 14:00 GMT

Summary

A PyTorch Dataset that pre-computes all valid sliding windows from pickled financial data and randomly samples from them during training epochs.

Concept

The Qlib Training Dataset principle defines how financial time series data is presented to the model during training. The core challenge is converting variable-length per-symbol time series into fixed-size windows suitable for batched training, while ensuring:

  • All valid windows across all symbols are accessible
  • Sampling is random but reproducible
  • Data normalization prevents information leakage
  • Training and validation use consistent but distinct sampling strategies

Theory

The dataset design follows a sliding window approach for time series with several important properties:

Pre-computed Index Enumeration

At initialization, the dataset enumerates all valid (symbol, start_index) pairs across the entire dataset. A valid pair is one where the window of size lookback_window + predict_window + 1 fits entirely within the symbol's available data. This pre-computation:

  • Avoids runtime boundary checking
  • Enables uniform sampling across all symbols regardless of their individual series lengths
  • Allows reporting the total number of available samples

Random Sampling with Epoch Seeding

Rather than iterating through samples sequentially, each call to __getitem__ draws a random index from the pre-computed pool using a dedicated Python random.Random instance. The seed of this RNG can be reset per epoch via set_epoch_seed(), which:

  • Ensures different random orderings across epochs
  • Guarantees reproducibility when combined with a fixed seed
  • Is essential for correctness in Distributed Data Parallel (DDP) training, where each rank must sample different data but in a reproducible manner

Instance-Level Normalization

Each window is normalized independently using its own mean and standard deviation (computed per feature across time steps). This is critical for financial data because:

  • Prevents data leakage: No information from outside the window is used
  • Handles non-stationarity: Financial time series have time-varying statistical properties
  • Clip bounds: After normalization, values are clipped to [-clip, clip] (default [-5.0, 5.0]) to suppress extreme outliers

Configurable Epoch Size

The effective dataset size per epoch is the minimum of a configured iteration count (n_train_iter or n_val_iter) and the total number of available samples. This allows controlling epoch length for large datasets where a full pass would be impractical.

Domains

  • Data_Loading: PyTorch Dataset and DataLoader integration
  • Time_Series: Sliding window construction and normalization
  • PyTorch: Custom Dataset implementation with DDP compatibility

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment