Implementation:Shiyu coder Kronos CustomKlineDataset Usage
| Field | Value |
|---|---|
| Implementation Name | CustomKlineDataset_Usage |
| Repository | Shiyu_coder_Kronos |
| Repository URL | https://github.com/shiyu-coder/Kronos |
| Type | API Doc |
| Source File | finetune_csv/finetune_base_model.py |
| Lines | L25-132 |
| Class | CustomKlineDataset(Dataset) |
| Implements Principle | Principle:Shiyu_coder_Kronos_CSV_Dataset_Handling |
| Dependencies | pandas, torch, numpy |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
CustomKlineDataset is a PyTorch Dataset subclass that loads custom CSV financial data (OHLCV + amount), generates temporal features, performs time-based train/val/test splitting, and provides normalized sliding windows for Kronos finetuning.
API
from finetune_base_model import CustomKlineDataset
dataset = CustomKlineDataset(
data_path,
data_type='train',
lookback_window=90,
predict_window=10,
clip=5.0,
seed=100,
train_ratio=0.7,
val_ratio=0.15,
test_ratio=0.15
)
Import
from finetune_base_model import CustomKlineDataset
Constructor Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| data_path | str | (required) | Path to CSV file with columns [timestamps, open, high, low, close, volume, amount] |
| data_type | str | 'train' | Split type: 'train', 'val', or 'test'
|
| lookback_window | int | 90 | Number of historical time steps in each window |
| predict_window | int | 10 | Number of future time steps in each window |
| clip | float | 5.0 | Clipping threshold for z-score normalized values |
| seed | int | 100 | Random seed for deterministic shuffling |
| train_ratio | float | 0.7 | Fraction of data allocated to training |
| val_ratio | float | 0.15 | Fraction of data allocated to validation |
| test_ratio | float | 0.15 | Fraction of data allocated to testing |
Input Format
CSV file must contain the following columns:
| Column | Type | Description |
|---|---|---|
| timestamps | datetime-parseable string | Timestamp for each data point |
| open | float | Opening price |
| high | float | High price |
| low | float | Low price |
| close | float | Closing price |
| volume | float | Trading volume |
| amount | float | Trading amount |
Output
The __getitem__ method returns a tuple of two tensors:
| Tensor | Shape | Description |
|---|---|---|
| x_tensor | (window, 6) |
Instance-normalized OHLCV + amount features. Window = lookback_window + predict_window + 1 |
| x_stamp_tensor | (window, 5) |
Temporal features: [minute, hour, weekday, day, month] |
Internal Processing Pipeline
Data Loading (_load_and_preprocess_data)
df = pd.read_csv(self.data_path)
df['timestamps'] = pd.to_datetime(df['timestamps'])
df = df.sort_values('timestamps').reset_index(drop=True)
# Generate temporal features
df['minute'] = df['timestamps'].dt.minute
df['hour'] = df['timestamps'].dt.hour
df['weekday'] = df['timestamps'].dt.weekday
df['day'] = df['timestamps'].dt.day
df['month'] = df['timestamps'].dt.month
Missing values are handled with forward fill if detected.
Time-Based Splitting (_split_data_by_time)
train_end = int(total_length * self.train_ratio)
val_end = int(total_length * (self.train_ratio + self.val_ratio))
# train: data[:train_end]
# val: data[train_end:val_end]
# test: data[val_end:]
Deterministic Shuffling (__getitem__)
For training data, sample starting positions are computed via a deterministic hash:
epoch = getattr(self, 'current_epoch', 0)
start_idx = (idx * 9973 + (epoch + 1) * 104729) % (max_start + 1)
For validation/test data, sequential access is used:
start_idx = idx % (max_start + 1)
Instance Normalization
x_mean, x_std = np.mean(x, axis=0), np.std(x, axis=0)
x = (x - x_mean) / (x_std + 1e-5)
x = np.clip(x, -self.clip, self.clip)
Epoch Seed Control
The set_epoch_seed method updates the current epoch for deterministic shuffling:
dataset.set_epoch_seed(epoch * 10000)
This is called by the training loop at the start of each epoch to ensure different sampling patterns.
Usage Example
from finetune_base_model import CustomKlineDataset
from torch.utils.data import DataLoader
# Create train dataset
train_dataset = CustomKlineDataset(
data_path="/path/to/kline_data.csv",
data_type='train',
lookback_window=512,
predict_window=48,
clip=5.0,
seed=42,
train_ratio=0.9,
val_ratio=0.1,
test_ratio=0.0
)
# Create DataLoader
train_loader = DataLoader(
train_dataset,
batch_size=32,
shuffle=True,
num_workers=6,
pin_memory=True,
drop_last=True
)
# Iterate
for x_tensor, x_stamp_tensor in train_loader:
# x_tensor: (batch, window, 6)
# x_stamp_tensor: (batch, window, 5)
pass
See Also
- Principle:Shiyu_coder_Kronos_CSV_Dataset_Handling -- The principle this implementation realizes
- Implementation:Shiyu_coder_Kronos_CustomFinetuneConfig_Init -- Configuration that provides dataset parameters