Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Online ml River Datasets SMTP

From Leeroopedia


Knowledge Sources
Domains Online_Learning, Datasets, Binary_Classification, Anomaly_Detection
Last Updated 2026-02-08 16:00 GMT

Overview

Concrete dataset for binary classification provided by the River library.

Description

SMTP dataset from the KDD 1999 cup. The goal is to predict whether or not an SMTP connection is anomalous or not. The dataset only contains 2,211 (0.4%) positive labels, making it highly imbalanced and suitable for anomaly detection tasks.

This dataset contains 95,156 samples with 3 features for binary classification tasks.

Usage

This dataset is useful for:

  • Network anomaly detection
  • Imbalanced classification problems
  • SMTP protocol security analysis
  • Evaluating classifiers on highly skewed data

Code Reference

Source Location

Signature

class SMTP(base.RemoteDataset):
    def __init__(self):
        super().__init__(
            n_samples=95_156,
            n_features=3,
            task=base.BINARY_CLF,
            url="https://maxhalford.github.io/files/datasets/smtp.zip",
            size=5_484_982,
            filename="smtp.csv",
        )

    def _iter(self):
        return stream.iter_csv(
            self.path,
            target="service",
            converters={
                "duration": float,
                "src_bytes": float,
                "dst_bytes": float,
                "service": int,
            },
        )

Import

from river import datasets
dataset = datasets.SMTP()

I/O Contract

Inputs

Name Type Required Description
(none) No parameters needed

Outputs

Name Type Description
iter() tuple(dict, int) Yields (features_dict, target) pairs where target indicates anomalous connections

Dataset Properties

Property Value
Number of samples 95,156
Number of features 3
Task Binary classification (anomaly detection)
Format CSV (compressed)
Size 5,484,982 bytes
Class imbalance Only 0.4% positive labels (2,211 anomalies)

Features

The dataset includes the following features:

  • duration: Duration of the SMTP connection (float)
  • src_bytes: Number of bytes from source to destination (float)
  • dst_bytes: Number of bytes from destination to source (float)
  • service: Target variable indicating if connection is anomalous (integer)

Usage Examples

from river import datasets

dataset = datasets.SMTP()
for x, y in dataset:
    print(x, y)
    break

References

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment