Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Shiyu coder Kronos CSV Dataset Handling

From Leeroopedia


Field Value
Principle Name CSV_Dataset_Handling
Repository Shiyu_coder_Kronos
Repository URL https://github.com/shiyu-coder/Kronos
Domains Data_Loading, Time_Series, PyTorch
Implemented By Implementation:Shiyu_coder_Kronos_CustomKlineDataset_Usage
Last Updated 2026-02-09 14:00 GMT

Overview

This principle describes loading and preprocessing custom CSV financial data into a PyTorch Dataset with time-based train/val/test splitting and instance-level normalization for use in Kronos finetuning.

Concept

Custom CSV financial data (OHLCV + amount) is loaded, parsed, sorted chronologically, augmented with temporal features, split by time-ordered ratios, and served through a sliding window sampler with deterministic epoch-based shuffling and per-window instance normalization.

Theory

The data pipeline follows a strict sequence of transformations:

1. Read CSV and Parse Timestamps

The CSV file is read with pandas. The timestamps column is parsed to datetime objects using pd.to_datetime(), and the data is sorted chronologically to ensure temporal ordering.

2. Generate Temporal Features

Five temporal features are extracted from each timestamp:

  • minute -- minute of the hour (0-59)
  • hour -- hour of the day (0-23)
  • weekday -- day of the week (0=Monday, 6=Sunday)
  • day -- day of the month (1-31)
  • month -- month of the year (1-12)

These temporal features are passed as a separate tensor (x_stamp) alongside the price/volume features.

3. Time-Based Splitting

Data is split by position in the sorted time series, not by random sampling:

  • Train: first train_ratio fraction of rows (e.g., first 90%)
  • Validation: next val_ratio fraction of rows (e.g., next 10%)
  • Test: remaining test_ratio fraction of rows

This preserves temporal causality -- the model never sees future data during training.

4. Sliding Window Sampling with Deterministic Shuffling

Each sample is a window of size lookback_window + predict_window + 1. For training data, the starting index for each sample is computed using a deterministic hash function:

start_idx = (idx * 9973 + (epoch + 1) * 104729) % (max_start + 1)

This provides:

  • Deterministic behavior given the same epoch and index
  • Pseudo-random shuffling across epochs by varying the epoch seed
  • Reproducibility across runs with identical seeds
  • Different sample orderings each epoch to improve generalization

For validation/test data, sampling is sequential: start_idx = idx % (max_start + 1).

5. Instance-Level Normalization

Each window is independently normalized:

x_mean, x_std = np.mean(x, axis=0), np.std(x, axis=0)
x = (x - x_mean) / (x_std + 1e-5)
x = np.clip(x, -clip, clip)
  • Per-column (per-feature) mean and standard deviation are computed over the window
  • Z-score normalization with epsilon 1e-5 for numerical stability
  • Clipping to [-clip, clip] (default 5.0) to bound outliers

This instance normalization is critical for financial time series because different time windows can have vastly different price scales and volatility regimes.

Features

The six value features extracted from CSV are:

  • open, high, low, close, volume, amount

The five temporal features are:

  • minute, hour, weekday, day, month

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment