Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Online ml River Datasets Phishing

From Leeroopedia


Knowledge Sources River River Docs
Domains Online_Learning Data_Ingestion Classification
Last Updated 2026-02-08 16:00 GMT

Overview

Concrete tool for loading the Phishing websites dataset as an iterable stream of (x, y) tuples for binary classification benchmarking in online learning.

Description

The Phishing class is a built-in River dataset that provides streaming access to the UCI Phishing Websites dataset. It contains 1,250 samples, each described by 9 features extracted from web pages, with a binary target (is_phishing) indicating whether the page is a phishing site.

The class inherits from base.FileDataset, which handles file location and metadata. When iterated, it delegates to stream.iter_csv to read the compressed CSV file (phishing.csv.gz) row by row. Each row is converted to a feature dictionary with appropriately typed values: most features are cast to float, some to int, and the target is converted to a bool via the lambda lambda x: x == "1".

The dataset serves as a commonly used benchmark for demonstrating and comparing binary classification algorithms in River's documentation and test suite.

Usage

Import this class when you need a ready-to-use, self-contained binary classification dataset for:

  • Benchmarking online classifiers (logistic regression, Hoeffding trees, Naive Bayes, etc.)
  • Demonstrating River pipelines and evaluation protocols in tutorials or examples.
  • Running quick experiments with a small but realistic dataset (1,250 samples).

Code Reference

Source Location

File Lines
river/datasets/phishing.py L8-L43

Signature

class Phishing(base.FileDataset):
    def __init__(self) -> None

The class takes no parameters. All metadata (number of samples, features, task type, filename) is hardcoded in the constructor.

Import

from river import datasets

dataset = datasets.Phishing()

I/O Contract

Inputs

Parameter Type Description
(none) N/A The constructor takes no arguments. Dataset metadata is fixed.

Outputs

Output Type Description
Iterator element (x: dict, y: bool) Each iteration yields a tuple of a feature dictionary and a boolean target. Features are keyed by name with float or int values. Target is True for phishing, False otherwise.
n_samples int 1,250 total samples
n_features int 9 features
task str Binary classification (base.BINARY_CLF)

Feature dictionary example:

{
    'empty_server_form_handler': 1.0,
    'popup_window': 0.0,
    'https': 1.0,
    'request_from_other_domain': 1.0,
    'anchor_from_other_domain': 0.0,
    'is_popular': 0.0,
    'long_url': 0.0,
    'age_of_domain': 1,
    'ip_in_url': 0
}

Usage Examples

Basic iteration:

from river import datasets

dataset = datasets.Phishing()

for x, y in dataset:
    print(x, y)
    break  # Just show first sample

Using with a pipeline and progressive validation:

from river import datasets, evaluate, linear_model, metrics, preprocessing

dataset = datasets.Phishing()

model = preprocessing.StandardScaler() | linear_model.LogisticRegression()
metric = metrics.Accuracy()

evaluate.progressive_val_score(dataset, model, metric)
# Accuracy: 88.96%

Accessing metadata:

from river import datasets

dataset = datasets.Phishing()
print(dataset.n_samples)   # 1250
print(dataset.n_features)  # 9
print(dataset.task)        # BINARY_CLF

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment