Implementation:Online ml River Datasets SMTP
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Online_Learning, Datasets, Binary_Classification, Anomaly_Detection |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Concrete dataset for binary classification provided by the River library.
Description
SMTP dataset from the KDD 1999 cup. The goal is to predict whether or not an SMTP connection is anomalous or not. The dataset only contains 2,211 (0.4%) positive labels, making it highly imbalanced and suitable for anomaly detection tasks.
This dataset contains 95,156 samples with 3 features for binary classification tasks.
Usage
This dataset is useful for:
- Network anomaly detection
- Imbalanced classification problems
- SMTP protocol security analysis
- Evaluating classifiers on highly skewed data
Code Reference
Source Location
- Repository: Online_ml_River
- File: river/datasets/smtp.py
Signature
class SMTP(base.RemoteDataset):
def __init__(self):
super().__init__(
n_samples=95_156,
n_features=3,
task=base.BINARY_CLF,
url="https://maxhalford.github.io/files/datasets/smtp.zip",
size=5_484_982,
filename="smtp.csv",
)
def _iter(self):
return stream.iter_csv(
self.path,
target="service",
converters={
"duration": float,
"src_bytes": float,
"dst_bytes": float,
"service": int,
},
)
Import
from river import datasets
dataset = datasets.SMTP()
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| (none) | — | — | No parameters needed |
Outputs
| Name | Type | Description |
|---|---|---|
| iter() | tuple(dict, int) | Yields (features_dict, target) pairs where target indicates anomalous connections |
Dataset Properties
| Property | Value |
|---|---|
| Number of samples | 95,156 |
| Number of features | 3 |
| Task | Binary classification (anomaly detection) |
| Format | CSV (compressed) |
| Size | 5,484,982 bytes |
| Class imbalance | Only 0.4% positive labels (2,211 anomalies) |
Features
The dataset includes the following features:
- duration: Duration of the SMTP connection (float)
- src_bytes: Number of bytes from source to destination (float)
- dst_bytes: Number of bytes from destination to source (float)
- service: Target variable indicating if connection is anomalous (integer)
Usage Examples
from river import datasets
dataset = datasets.SMTP()
for x, y in dataset:
print(x, y)
break
References
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment