Implementation:Online ml River Datasets SMSSpam

Knowledge Sources	Online_ml_River
Domains	Online_Learning, Datasets, Binary_Classification, Text_Classification
Last Updated	2026-02-08 16:00 GMT

Overview

Concrete dataset for binary text classification provided by the River library.

Description

SMS Spam Collection dataset. The data contains 5,574 items and 1 feature (i.e. SMS body). Spam messages represent 13.4% of the dataset. The goal is to predict whether an SMS is a spam or not.

This dataset contains 5,574 samples with 1 text feature for binary classification tasks.

Usage

This dataset is useful for:

Text classification and spam detection
Natural language processing tasks
Imbalanced classification problems
SMS/message filtering applications

Code Reference

Source Location

Repository: Online_ml_River
File: river/datasets/sms_spam.py

Signature

class SMSSpam(base.RemoteDataset):
    def __init__(self):
        super().__init__(
            n_samples=5_574,
            n_features=1,
            task=base.BINARY_CLF,
            url="https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip",
            size=477_907,
            filename="SMSSpamCollection",
        )

    def _iter(self):
        with open(self.path) as f:
            for row in f:
                label, body = row.split("\t")
                yield ({"body": body}, label == "spam")

Import

from river import datasets
dataset = datasets.SMSSpam()

I/O Contract

Inputs

Name	Type	Required	Description
(none)	—	—	No parameters needed

Outputs

Name	Type	Description
iter()	tuple(dict, bool)	Yields ({"body": text}, is_spam) pairs

Dataset Properties

Property	Value
Number of samples	5,574
Number of features	1 (text)
Task	Binary classification
Format	Tab-separated text
Size	477,907 bytes
Spam percentage	13.4%

Features

body: The SMS message text content (string)

Target

Boolean value indicating whether the SMS is spam (True) or ham/legitimate (False)

Usage Examples

from river import datasets

dataset = datasets.SMSSpam()
for x, y in dataset:
    print(f"Message: {x['body'][:50]}...")
    print(f"Is spam: {y}")
    break

Example with Text Processing

from river import datasets, feature_extraction, naive_bayes

dataset = datasets.SMSSpam()
model = feature_extraction.BagOfWords() | naive_bayes.BernoulliNB()

for x, y in dataset:
    model.predict_one(x)
    model.learn_one(x, y)

References

Almeida, T.A., Hidalgo, J.M.G. and Yamakami, A., 2011, September. Contributions to the study of SMS spam filtering: new collection and results. In Proceedings of the 11th ACM symposium on Document engineering (pp. 259-262). [1]

Related Pages

Environment:Online_ml_River_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment