Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Shiyu coder Kronos CustomKlineDataset Usage

From Leeroopedia


Field Value
Implementation Name CustomKlineDataset_Usage
Repository Shiyu_coder_Kronos
Repository URL https://github.com/shiyu-coder/Kronos
Type API Doc
Source File finetune_csv/finetune_base_model.py
Lines L25-132
Class CustomKlineDataset(Dataset)
Implements Principle Principle:Shiyu_coder_Kronos_CSV_Dataset_Handling
Dependencies pandas, torch, numpy
Last Updated 2026-02-09 14:00 GMT

Overview

CustomKlineDataset is a PyTorch Dataset subclass that loads custom CSV financial data (OHLCV + amount), generates temporal features, performs time-based train/val/test splitting, and provides normalized sliding windows for Kronos finetuning.

API

from finetune_base_model import CustomKlineDataset

dataset = CustomKlineDataset(
    data_path,
    data_type='train',
    lookback_window=90,
    predict_window=10,
    clip=5.0,
    seed=100,
    train_ratio=0.7,
    val_ratio=0.15,
    test_ratio=0.15
)

Import

from finetune_base_model import CustomKlineDataset

Constructor Parameters

Parameter Type Default Description
data_path str (required) Path to CSV file with columns [timestamps, open, high, low, close, volume, amount]
data_type str 'train' Split type: 'train', 'val', or 'test'
lookback_window int 90 Number of historical time steps in each window
predict_window int 10 Number of future time steps in each window
clip float 5.0 Clipping threshold for z-score normalized values
seed int 100 Random seed for deterministic shuffling
train_ratio float 0.7 Fraction of data allocated to training
val_ratio float 0.15 Fraction of data allocated to validation
test_ratio float 0.15 Fraction of data allocated to testing

Input Format

CSV file must contain the following columns:

Column Type Description
timestamps datetime-parseable string Timestamp for each data point
open float Opening price
high float High price
low float Low price
close float Closing price
volume float Trading volume
amount float Trading amount

Output

The __getitem__ method returns a tuple of two tensors:

Tensor Shape Description
x_tensor (window, 6) Instance-normalized OHLCV + amount features. Window = lookback_window + predict_window + 1
x_stamp_tensor (window, 5) Temporal features: [minute, hour, weekday, day, month]

Internal Processing Pipeline

Data Loading (_load_and_preprocess_data)

df = pd.read_csv(self.data_path)
df['timestamps'] = pd.to_datetime(df['timestamps'])
df = df.sort_values('timestamps').reset_index(drop=True)

# Generate temporal features
df['minute'] = df['timestamps'].dt.minute
df['hour'] = df['timestamps'].dt.hour
df['weekday'] = df['timestamps'].dt.weekday
df['day'] = df['timestamps'].dt.day
df['month'] = df['timestamps'].dt.month

Missing values are handled with forward fill if detected.

Time-Based Splitting (_split_data_by_time)

train_end = int(total_length * self.train_ratio)
val_end = int(total_length * (self.train_ratio + self.val_ratio))

# train: data[:train_end]
# val:   data[train_end:val_end]
# test:  data[val_end:]

Deterministic Shuffling (__getitem__)

For training data, sample starting positions are computed via a deterministic hash:

epoch = getattr(self, 'current_epoch', 0)
start_idx = (idx * 9973 + (epoch + 1) * 104729) % (max_start + 1)

For validation/test data, sequential access is used:

start_idx = idx % (max_start + 1)

Instance Normalization

x_mean, x_std = np.mean(x, axis=0), np.std(x, axis=0)
x = (x - x_mean) / (x_std + 1e-5)
x = np.clip(x, -self.clip, self.clip)

Epoch Seed Control

The set_epoch_seed method updates the current epoch for deterministic shuffling:

dataset.set_epoch_seed(epoch * 10000)

This is called by the training loop at the start of each epoch to ensure different sampling patterns.

Usage Example

from finetune_base_model import CustomKlineDataset
from torch.utils.data import DataLoader

# Create train dataset
train_dataset = CustomKlineDataset(
    data_path="/path/to/kline_data.csv",
    data_type='train',
    lookback_window=512,
    predict_window=48,
    clip=5.0,
    seed=42,
    train_ratio=0.9,
    val_ratio=0.1,
    test_ratio=0.0
)

# Create DataLoader
train_loader = DataLoader(
    train_dataset,
    batch_size=32,
    shuffle=True,
    num_workers=6,
    pin_memory=True,
    drop_last=True
)

# Iterate
for x_tensor, x_stamp_tensor in train_loader:
    # x_tensor:       (batch, window, 6)
    # x_stamp_tensor: (batch, window, 5)
    pass

See Also

Environment & Heuristic Links

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment