Principle:Shiyu coder Kronos CSV Dataset Handling
| Field | Value |
|---|---|
| Principle Name | CSV_Dataset_Handling |
| Repository | Shiyu_coder_Kronos |
| Repository URL | https://github.com/shiyu-coder/Kronos |
| Domains | Data_Loading, Time_Series, PyTorch |
| Implemented By | Implementation:Shiyu_coder_Kronos_CustomKlineDataset_Usage |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
This principle describes loading and preprocessing custom CSV financial data into a PyTorch Dataset with time-based train/val/test splitting and instance-level normalization for use in Kronos finetuning.
Concept
Custom CSV financial data (OHLCV + amount) is loaded, parsed, sorted chronologically, augmented with temporal features, split by time-ordered ratios, and served through a sliding window sampler with deterministic epoch-based shuffling and per-window instance normalization.
Theory
The data pipeline follows a strict sequence of transformations:
1. Read CSV and Parse Timestamps
The CSV file is read with pandas. The timestamps column is parsed to datetime objects using pd.to_datetime(), and the data is sorted chronologically to ensure temporal ordering.
2. Generate Temporal Features
Five temporal features are extracted from each timestamp:
- minute -- minute of the hour (0-59)
- hour -- hour of the day (0-23)
- weekday -- day of the week (0=Monday, 6=Sunday)
- day -- day of the month (1-31)
- month -- month of the year (1-12)
These temporal features are passed as a separate tensor (x_stamp) alongside the price/volume features.
3. Time-Based Splitting
Data is split by position in the sorted time series, not by random sampling:
- Train: first
train_ratiofraction of rows (e.g., first 90%) - Validation: next
val_ratiofraction of rows (e.g., next 10%) - Test: remaining
test_ratiofraction of rows
This preserves temporal causality -- the model never sees future data during training.
4. Sliding Window Sampling with Deterministic Shuffling
Each sample is a window of size lookback_window + predict_window + 1. For training data, the starting index for each sample is computed using a deterministic hash function:
start_idx = (idx * 9973 + (epoch + 1) * 104729) % (max_start + 1)
This provides:
- Deterministic behavior given the same epoch and index
- Pseudo-random shuffling across epochs by varying the epoch seed
- Reproducibility across runs with identical seeds
- Different sample orderings each epoch to improve generalization
For validation/test data, sampling is sequential: start_idx = idx % (max_start + 1).
5. Instance-Level Normalization
Each window is independently normalized:
x_mean, x_std = np.mean(x, axis=0), np.std(x, axis=0)
x = (x - x_mean) / (x_std + 1e-5)
x = np.clip(x, -clip, clip)
- Per-column (per-feature) mean and standard deviation are computed over the window
- Z-score normalization with epsilon
1e-5for numerical stability - Clipping to
[-clip, clip](default 5.0) to bound outliers
This instance normalization is critical for financial time series because different time windows can have vastly different price scales and volatility regimes.
Features
The six value features extracted from CSV are:
open,high,low,close,volume,amount
The five temporal features are:
minute,hour,weekday,day,month
See Also
- Implementation:Shiyu_coder_Kronos_CustomKlineDataset_Usage -- API documentation for CustomKlineDataset
- Principle:Shiyu_coder_Kronos_CSV_Finetuning_Configuration -- Configuration that drives dataset parameters