Implementation:Online ml River Datasets Phishing
| Knowledge Sources | River River Docs |
|---|---|
| Domains | Online_Learning Data_Ingestion Classification |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Concrete tool for loading the Phishing websites dataset as an iterable stream of (x, y) tuples for binary classification benchmarking in online learning.
Description
The Phishing class is a built-in River dataset that provides streaming access to the UCI Phishing Websites dataset. It contains 1,250 samples, each described by 9 features extracted from web pages, with a binary target (is_phishing) indicating whether the page is a phishing site.
The class inherits from base.FileDataset, which handles file location and metadata. When iterated, it delegates to stream.iter_csv to read the compressed CSV file (phishing.csv.gz) row by row. Each row is converted to a feature dictionary with appropriately typed values: most features are cast to float, some to int, and the target is converted to a bool via the lambda lambda x: x == "1".
The dataset serves as a commonly used benchmark for demonstrating and comparing binary classification algorithms in River's documentation and test suite.
Usage
Import this class when you need a ready-to-use, self-contained binary classification dataset for:
- Benchmarking online classifiers (logistic regression, Hoeffding trees, Naive Bayes, etc.)
- Demonstrating River pipelines and evaluation protocols in tutorials or examples.
- Running quick experiments with a small but realistic dataset (1,250 samples).
Code Reference
Source Location
| File | Lines |
|---|---|
river/datasets/phishing.py |
L8-L43 |
Signature
class Phishing(base.FileDataset):
def __init__(self) -> None
The class takes no parameters. All metadata (number of samples, features, task type, filename) is hardcoded in the constructor.
Import
from river import datasets
dataset = datasets.Phishing()
I/O Contract
Inputs
| Parameter | Type | Description |
|---|---|---|
| (none) | N/A | The constructor takes no arguments. Dataset metadata is fixed. |
Outputs
| Output | Type | Description |
|---|---|---|
| Iterator element | (x: dict, y: bool) |
Each iteration yields a tuple of a feature dictionary and a boolean target. Features are keyed by name with float or int values. Target is True for phishing, False otherwise.
|
| n_samples | int |
1,250 total samples |
| n_features | int |
9 features |
| task | str |
Binary classification (base.BINARY_CLF)
|
Feature dictionary example:
{
'empty_server_form_handler': 1.0,
'popup_window': 0.0,
'https': 1.0,
'request_from_other_domain': 1.0,
'anchor_from_other_domain': 0.0,
'is_popular': 0.0,
'long_url': 0.0,
'age_of_domain': 1,
'ip_in_url': 0
}
Usage Examples
Basic iteration:
from river import datasets
dataset = datasets.Phishing()
for x, y in dataset:
print(x, y)
break # Just show first sample
Using with a pipeline and progressive validation:
from river import datasets, evaluate, linear_model, metrics, preprocessing
dataset = datasets.Phishing()
model = preprocessing.StandardScaler() | linear_model.LogisticRegression()
metric = metrics.Accuracy()
evaluate.progressive_val_score(dataset, model, metric)
# Accuracy: 88.96%
Accessing metadata:
from river import datasets
dataset = datasets.Phishing()
print(dataset.n_samples) # 1250
print(dataset.n_features) # 9
print(dataset.task) # BINARY_CLF