Implementation:Shiyu coder Kronos CustomKlineDataset Usage

Field	Value
Implementation Name	CustomKlineDataset_Usage
Repository	Shiyu_coder_Kronos
Repository URL	https://github.com/shiyu-coder/Kronos
Type	API Doc
Source File	finetune_csv/finetune_base_model.py
Lines	L25-132
Class	CustomKlineDataset(Dataset)
Implements Principle	Principle:Shiyu_coder_Kronos_CSV_Dataset_Handling
Dependencies	pandas, torch, numpy
Last Updated	2026-02-09 14:00 GMT

Overview

CustomKlineDataset is a PyTorch Dataset subclass that loads custom CSV financial data (OHLCV + amount), generates temporal features, performs time-based train/val/test splitting, and provides normalized sliding windows for Kronos finetuning.

API

from finetune_base_model import CustomKlineDataset

dataset = CustomKlineDataset(
    data_path,
    data_type='train',
    lookback_window=90,
    predict_window=10,
    clip=5.0,
    seed=100,
    train_ratio=0.7,
    val_ratio=0.15,
    test_ratio=0.15
)

Import

from finetune_base_model import CustomKlineDataset

Constructor Parameters

Parameter	Type	Default	Description
data_path	str	(required)	Path to CSV file with columns [timestamps, open, high, low, close, volume, amount]
data_type	str	'train'	Split type: `'train'`, `'val'`, or `'test'`
lookback_window	int	90	Number of historical time steps in each window
predict_window	int	10	Number of future time steps in each window
clip	float	5.0	Clipping threshold for z-score normalized values
seed	int	100	Random seed for deterministic shuffling
train_ratio	float	0.7	Fraction of data allocated to training
val_ratio	float	0.15	Fraction of data allocated to validation
test_ratio	float	0.15	Fraction of data allocated to testing

Input Format

CSV file must contain the following columns:

Column	Type	Description
timestamps	datetime-parseable string	Timestamp for each data point
open	float	Opening price
high	float	High price
low	float	Low price
close	float	Closing price
volume	float	Trading volume
amount	float	Trading amount

Output

The __getitem__ method returns a tuple of two tensors:

Tensor	Shape	Description
x_tensor	`(window, 6)`	Instance-normalized OHLCV + amount features. Window = lookback_window + predict_window + 1
x_stamp_tensor	`(window, 5)`	Temporal features: [minute, hour, weekday, day, month]

Internal Processing Pipeline

Data Loading (`_load_and_preprocess_data`)

df = pd.read_csv(self.data_path)
df['timestamps'] = pd.to_datetime(df['timestamps'])
df = df.sort_values('timestamps').reset_index(drop=True)

# Generate temporal features
df['minute'] = df['timestamps'].dt.minute
df['hour'] = df['timestamps'].dt.hour
df['weekday'] = df['timestamps'].dt.weekday
df['day'] = df['timestamps'].dt.day
df['month'] = df['timestamps'].dt.month

Missing values are handled with forward fill if detected.

Time-Based Splitting (`_split_data_by_time`)

train_end = int(total_length * self.train_ratio)
val_end = int(total_length * (self.train_ratio + self.val_ratio))

# train: data[:train_end]
# val:   data[train_end:val_end]
# test:  data[val_end:]

Deterministic Shuffling (`getitem`)

For training data, sample starting positions are computed via a deterministic hash:

epoch = getattr(self, 'current_epoch', 0)
start_idx = (idx * 9973 + (epoch + 1) * 104729) % (max_start + 1)

For validation/test data, sequential access is used:

start_idx = idx % (max_start + 1)

Instance Normalization

x_mean, x_std = np.mean(x, axis=0), np.std(x, axis=0)
x = (x - x_mean) / (x_std + 1e-5)
x = np.clip(x, -self.clip, self.clip)

Epoch Seed Control

The set_epoch_seed method updates the current epoch for deterministic shuffling:

dataset.set_epoch_seed(epoch * 10000)

This is called by the training loop at the start of each epoch to ensure different sampling patterns.

Usage Example

from finetune_base_model import CustomKlineDataset
from torch.utils.data import DataLoader

# Create train dataset
train_dataset = CustomKlineDataset(
    data_path="/path/to/kline_data.csv",
    data_type='train',
    lookback_window=512,
    predict_window=48,
    clip=5.0,
    seed=42,
    train_ratio=0.9,
    val_ratio=0.1,
    test_ratio=0.0
)

# Create DataLoader
train_loader = DataLoader(
    train_dataset,
    batch_size=32,
    shuffle=True,
    num_workers=6,
    pin_memory=True,
    drop_last=True
)

# Iterate
for x_tensor, x_stamp_tensor in train_loader:
    # x_tensor:       (batch, window, 6)
    # x_stamp_tensor: (batch, window, 5)
    pass

Environment & Heuristic Links

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment