Principle:Shiyu coder Kronos CSV Dataset Handling

Field	Value
Principle Name	CSV_Dataset_Handling
Repository	Shiyu_coder_Kronos
Repository URL	https://github.com/shiyu-coder/Kronos
Domains	Data_Loading, Time_Series, PyTorch
Implemented By	Implementation:Shiyu_coder_Kronos_CustomKlineDataset_Usage
Last Updated	2026-02-09 14:00 GMT

Overview

This principle describes loading and preprocessing custom CSV financial data into a PyTorch Dataset with time-based train/val/test splitting and instance-level normalization for use in Kronos finetuning.

Concept

Custom CSV financial data (OHLCV + amount) is loaded, parsed, sorted chronologically, augmented with temporal features, split by time-ordered ratios, and served through a sliding window sampler with deterministic epoch-based shuffling and per-window instance normalization.

Theory

The data pipeline follows a strict sequence of transformations:

1. Read CSV and Parse Timestamps

The CSV file is read with pandas. The timestamps column is parsed to datetime objects using pd.to_datetime(), and the data is sorted chronologically to ensure temporal ordering.

2. Generate Temporal Features

Five temporal features are extracted from each timestamp:

minute -- minute of the hour (0-59)
hour -- hour of the day (0-23)
weekday -- day of the week (0=Monday, 6=Sunday)
day -- day of the month (1-31)
month -- month of the year (1-12)

These temporal features are passed as a separate tensor (x_stamp) alongside the price/volume features.

3. Time-Based Splitting

Data is split by position in the sorted time series, not by random sampling:

Train: first train_ratio fraction of rows (e.g., first 90%)
Validation: next val_ratio fraction of rows (e.g., next 10%)
Test: remaining test_ratio fraction of rows

This preserves temporal causality -- the model never sees future data during training.

4. Sliding Window Sampling with Deterministic Shuffling

Each sample is a window of size lookback_window + predict_window + 1. For training data, the starting index for each sample is computed using a deterministic hash function:

start_idx = (idx * 9973 + (epoch + 1) * 104729) % (max_start + 1)

This provides:

Deterministic behavior given the same epoch and index
Pseudo-random shuffling across epochs by varying the epoch seed
Reproducibility across runs with identical seeds
Different sample orderings each epoch to improve generalization

For validation/test data, sampling is sequential: start_idx = idx % (max_start + 1).

5. Instance-Level Normalization

Each window is independently normalized:

x_mean, x_std = np.mean(x, axis=0), np.std(x, axis=0)
x = (x - x_mean) / (x_std + 1e-5)
x = np.clip(x, -clip, clip)

Per-column (per-feature) mean and standard deviation are computed over the window
Z-score normalization with epsilon 1e-5 for numerical stability
Clipping to [-clip, clip] (default 5.0) to bound outliers

This instance normalization is critical for financial time series because different time windows can have vastly different price scales and volatility regimes.

Features

The six value features extracted from CSV are:

open, high, low, close, volume, amount

The five temporal features are:

minute, hour, weekday, day, month

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment