Implementation:Online ml River Datasets Phishing

Knowledge Sources	River River Docs
Domains	Online_Learning Data_Ingestion Classification
Last Updated	2026-02-08 16:00 GMT

Overview

Concrete tool for loading the Phishing websites dataset as an iterable stream of (x, y) tuples for binary classification benchmarking in online learning.

Description

The Phishing class is a built-in River dataset that provides streaming access to the UCI Phishing Websites dataset. It contains 1,250 samples, each described by 9 features extracted from web pages, with a binary target (is_phishing) indicating whether the page is a phishing site.

The class inherits from base.FileDataset, which handles file location and metadata. When iterated, it delegates to stream.iter_csv to read the compressed CSV file (phishing.csv.gz) row by row. Each row is converted to a feature dictionary with appropriately typed values: most features are cast to float, some to int, and the target is converted to a bool via the lambda lambda x: x == "1".

The dataset serves as a commonly used benchmark for demonstrating and comparing binary classification algorithms in River's documentation and test suite.

Usage

Import this class when you need a ready-to-use, self-contained binary classification dataset for:

Benchmarking online classifiers (logistic regression, Hoeffding trees, Naive Bayes, etc.)
Demonstrating River pipelines and evaluation protocols in tutorials or examples.
Running quick experiments with a small but realistic dataset (1,250 samples).

Code Reference

Source Location

File	Lines
`river/datasets/phishing.py`	L8-L43

Signature

class Phishing(base.FileDataset):
    def __init__(self) -> None

The class takes no parameters. All metadata (number of samples, features, task type, filename) is hardcoded in the constructor.

Import

from river import datasets

dataset = datasets.Phishing()

I/O Contract

Inputs

Parameter	Type	Description
(none)	N/A	The constructor takes no arguments. Dataset metadata is fixed.

Outputs

Output	Type	Description
Iterator element	`(x: dict, y: bool)`	Each iteration yields a tuple of a feature dictionary and a boolean target. Features are keyed by name with `float` or `int` values. Target is `True` for phishing, `False` otherwise.
n_samples	`int`	1,250 total samples
n_features	`int`	9 features
task	`str`	Binary classification (`base.BINARY_CLF`)

Feature dictionary example:

{
    'empty_server_form_handler': 1.0,
    'popup_window': 0.0,
    'https': 1.0,
    'request_from_other_domain': 1.0,
    'anchor_from_other_domain': 0.0,
    'is_popular': 0.0,
    'long_url': 0.0,
    'age_of_domain': 1,
    'ip_in_url': 0
}

Usage Examples

Basic iteration:

from river import datasets

dataset = datasets.Phishing()

for x, y in dataset:
    print(x, y)
    break  # Just show first sample

Using with a pipeline and progressive validation:

from river import datasets, evaluate, linear_model, metrics, preprocessing

dataset = datasets.Phishing()

model = preprocessing.StandardScaler() | linear_model.LogisticRegression()
metric = metrics.Accuracy()

evaluate.progressive_val_score(dataset, model, metric)
# Accuracy: 88.96%

Accessing metadata:

from river import datasets

dataset = datasets.Phishing()
print(dataset.n_samples)   # 1250
print(dataset.n_features)  # 9
print(dataset.task)        # BINARY_CLF

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment