Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Online ml River Datasets HTTP

From Leeroopedia
Revision as of 16:07, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Online_ml_River_Datasets_HTTP.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Online_Learning, Datasets, Binary_Classification, Anomaly_Detection
Last Updated 2026-02-08 16:00 GMT

Overview

Concrete dataset for binary classification provided by the River library.

Description

HTTP dataset of the KDD 1999 cup. The goal is to predict whether or not an HTTP connection is anomalous or not. The dataset only contains 2,211 (0.4%) positive labels, making it a highly imbalanced dataset suitable for anomaly detection tasks.

This dataset contains 567,498 samples with 3 features for binary classification tasks.

Usage

This dataset is useful for:

  • Anomaly detection and network intrusion detection
  • Handling imbalanced classification problems
  • Evaluating classifiers on highly skewed data distributions
  • Cybersecurity and network security applications

Code Reference

Source Location

Signature

class HTTP(base.RemoteDataset):
    def __init__(self):
        super().__init__(
            n_samples=567_498,
            n_features=3,
            task=base.BINARY_CLF,
            url="https://maxhalford.github.io/files/datasets/kdd99_http.zip",
            size=32_400_738,
            filename="kdd99_http.csv",
        )

    def _iter(self):
        converters = {
            "duration": float,
            "src_bytes": float,
            "dst_bytes": float,
            "service": int,
        }
        return stream.iter_csv(self.path, target="service", converters=converters)

Import

from river import datasets
dataset = datasets.HTTP()

I/O Contract

Inputs

Name Type Required Description
(none) No parameters needed

Outputs

Name Type Description
iter() tuple(dict, int) Yields (features_dict, target) pairs where target indicates anomalous connections

Dataset Properties

Property Value
Number of samples 567,498
Number of features 3
Task Binary classification (anomaly detection)
Format CSV (compressed)
Size 32,400,738 bytes
Class imbalance Only 0.4% positive labels (2,211 anomalies)

Features

The dataset includes the following features:

  • duration: Duration of the connection (float)
  • src_bytes: Number of bytes from source to destination (float)
  • dst_bytes: Number of bytes from destination to source (float)
  • service: Target variable indicating if connection is anomalous (integer)

Usage Examples

from river import datasets

dataset = datasets.HTTP()
for x, y in dataset:
    print(x, y)
    break

References

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment