Implementation:Online ml River Datasets HTTP
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Online_Learning, Datasets, Binary_Classification, Anomaly_Detection |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Concrete dataset for binary classification provided by the River library.
Description
HTTP dataset of the KDD 1999 cup. The goal is to predict whether or not an HTTP connection is anomalous or not. The dataset only contains 2,211 (0.4%) positive labels, making it a highly imbalanced dataset suitable for anomaly detection tasks.
This dataset contains 567,498 samples with 3 features for binary classification tasks.
Usage
This dataset is useful for:
- Anomaly detection and network intrusion detection
- Handling imbalanced classification problems
- Evaluating classifiers on highly skewed data distributions
- Cybersecurity and network security applications
Code Reference
Source Location
- Repository: Online_ml_River
- File: river/datasets/http.py
Signature
class HTTP(base.RemoteDataset):
def __init__(self):
super().__init__(
n_samples=567_498,
n_features=3,
task=base.BINARY_CLF,
url="https://maxhalford.github.io/files/datasets/kdd99_http.zip",
size=32_400_738,
filename="kdd99_http.csv",
)
def _iter(self):
converters = {
"duration": float,
"src_bytes": float,
"dst_bytes": float,
"service": int,
}
return stream.iter_csv(self.path, target="service", converters=converters)
Import
from river import datasets
dataset = datasets.HTTP()
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| (none) | — | — | No parameters needed |
Outputs
| Name | Type | Description |
|---|---|---|
| iter() | tuple(dict, int) | Yields (features_dict, target) pairs where target indicates anomalous connections |
Dataset Properties
| Property | Value |
|---|---|
| Number of samples | 567,498 |
| Number of features | 3 |
| Task | Binary classification (anomaly detection) |
| Format | CSV (compressed) |
| Size | 32,400,738 bytes |
| Class imbalance | Only 0.4% positive labels (2,211 anomalies) |
Features
The dataset includes the following features:
- duration: Duration of the connection (float)
- src_bytes: Number of bytes from source to destination (float)
- dst_bytes: Number of bytes from destination to source (float)
- service: Target variable indicating if connection is anomalous (integer)
Usage Examples
from river import datasets
dataset = datasets.HTTP()
for x, y in dataset:
print(x, y)
break
References
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment