Implementation:Online ml River Datasets TREC07
| Knowledge Sources | |
|---|---|
| Domains | Online_Learning, Datasets, Binary_Classification, Text_Classification |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Concrete dataset for binary text classification provided by the River library.
Description
TREC's 2007 Spam Track dataset. The data contains 75,419 chronologically ordered items, i.e. 3 months of emails delivered to a particular server in 2007. Spam messages represent 66.6% of the dataset. The goal is to predict whether an email is a spam or not.
The available raw features are: sender, recipients, date, subject, body.
This dataset contains 75,419 samples with 5 features for binary classification tasks.
Usage
This dataset is useful for:
- Email spam detection and filtering
- Text classification on structured email data
- Time-ordered classification (chronologically sorted)
- Handling imbalanced text classification
Code Reference
Source Location
- Repository: Online_ml_River
- File: river/datasets/trec07.py
Signature
class TREC07(base.RemoteDataset):
def __init__(self):
super().__init__(
n_samples=75_419,
n_features=5,
task=base.BINARY_CLF,
url="https://maxhalford.github.io/files/datasets/trec07p.zip",
size=144_504_829,
filename="trec07p.csv",
)
def _iter(self):
return stream.iter_csv(
self.path,
target="y",
delimiter=",",
quotechar='"',
field_size_limit=1_000_000,
)
Import
from river import datasets
dataset = datasets.TREC07()
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| (none) | — | — | No parameters needed |
Outputs
| Name | Type | Description |
|---|---|---|
| iter() | tuple(dict, target) | Yields (features_dict, target) pairs with email metadata and spam label |
Dataset Properties
| Property | Value |
|---|---|
| Number of samples | 75,419 |
| Number of features | 5 |
| Task | Binary classification |
| Format | CSV (compressed) |
| Size | 144,504,829 bytes (~138 MB) |
| Spam percentage | 66.6% |
| Time period | 3 months (2007) |
| Ordering | Chronological |
Features
The dataset includes the following email features:
- sender: Email sender address
- recipients: Email recipient addresses
- date: Email timestamp
- subject: Email subject line
- body: Email body content
- y: Target variable indicating spam or ham
Usage Examples
from river import datasets
dataset = datasets.TREC07()
for x, y in dataset:
print(f"Features: {list(x.keys())}")
print(f"Is spam: {y}")
break
Example with Text Processing
from river import datasets, feature_extraction, naive_bayes, compose
dataset = datasets.TREC07()
model = compose.Pipeline(
feature_extraction.BagOfWords('body'),
naive_bayes.BernoulliNB()
)
for x, y in dataset:
model.predict_one(x)
model.learn_one(x, y)