Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Online ml River Datasets TREC07

From Leeroopedia


Knowledge Sources
Domains Online_Learning, Datasets, Binary_Classification, Text_Classification
Last Updated 2026-02-08 16:00 GMT

Overview

Concrete dataset for binary text classification provided by the River library.

Description

TREC's 2007 Spam Track dataset. The data contains 75,419 chronologically ordered items, i.e. 3 months of emails delivered to a particular server in 2007. Spam messages represent 66.6% of the dataset. The goal is to predict whether an email is a spam or not.

The available raw features are: sender, recipients, date, subject, body.

This dataset contains 75,419 samples with 5 features for binary classification tasks.

Usage

This dataset is useful for:

  • Email spam detection and filtering
  • Text classification on structured email data
  • Time-ordered classification (chronologically sorted)
  • Handling imbalanced text classification

Code Reference

Source Location

Signature

class TREC07(base.RemoteDataset):
    def __init__(self):
        super().__init__(
            n_samples=75_419,
            n_features=5,
            task=base.BINARY_CLF,
            url="https://maxhalford.github.io/files/datasets/trec07p.zip",
            size=144_504_829,
            filename="trec07p.csv",
        )

    def _iter(self):
        return stream.iter_csv(
            self.path,
            target="y",
            delimiter=",",
            quotechar='"',
            field_size_limit=1_000_000,
        )

Import

from river import datasets
dataset = datasets.TREC07()

I/O Contract

Inputs

Name Type Required Description
(none) No parameters needed

Outputs

Name Type Description
iter() tuple(dict, target) Yields (features_dict, target) pairs with email metadata and spam label

Dataset Properties

Property Value
Number of samples 75,419
Number of features 5
Task Binary classification
Format CSV (compressed)
Size 144,504,829 bytes (~138 MB)
Spam percentage 66.6%
Time period 3 months (2007)
Ordering Chronological

Features

The dataset includes the following email features:

  • sender: Email sender address
  • recipients: Email recipient addresses
  • date: Email timestamp
  • subject: Email subject line
  • body: Email body content
  • y: Target variable indicating spam or ham

Usage Examples

from river import datasets

dataset = datasets.TREC07()
for x, y in dataset:
    print(f"Features: {list(x.keys())}")
    print(f"Is spam: {y}")
    break

Example with Text Processing

from river import datasets, feature_extraction, naive_bayes, compose

dataset = datasets.TREC07()

model = compose.Pipeline(
    feature_extraction.BagOfWords('body'),
    naive_bayes.BernoulliNB()
)

for x, y in dataset:
    model.predict_one(x)
    model.learn_one(x, y)

References

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment