Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Online ml River Datasets SMSSpam

From Leeroopedia


Knowledge Sources
Domains Online_Learning, Datasets, Binary_Classification, Text_Classification
Last Updated 2026-02-08 16:00 GMT

Overview

Concrete dataset for binary text classification provided by the River library.

Description

SMS Spam Collection dataset. The data contains 5,574 items and 1 feature (i.e. SMS body). Spam messages represent 13.4% of the dataset. The goal is to predict whether an SMS is a spam or not.

This dataset contains 5,574 samples with 1 text feature for binary classification tasks.

Usage

This dataset is useful for:

  • Text classification and spam detection
  • Natural language processing tasks
  • Imbalanced classification problems
  • SMS/message filtering applications

Code Reference

Source Location

Signature

class SMSSpam(base.RemoteDataset):
    def __init__(self):
        super().__init__(
            n_samples=5_574,
            n_features=1,
            task=base.BINARY_CLF,
            url="https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip",
            size=477_907,
            filename="SMSSpamCollection",
        )

    def _iter(self):
        with open(self.path) as f:
            for row in f:
                label, body = row.split("\t")
                yield ({"body": body}, label == "spam")

Import

from river import datasets
dataset = datasets.SMSSpam()

I/O Contract

Inputs

Name Type Required Description
(none) No parameters needed

Outputs

Name Type Description
iter() tuple(dict, bool) Yields ({"body": text}, is_spam) pairs

Dataset Properties

Property Value
Number of samples 5,574
Number of features 1 (text)
Task Binary classification
Format Tab-separated text
Size 477,907 bytes
Spam percentage 13.4%

Features

  • body: The SMS message text content (string)

Target

  • Boolean value indicating whether the SMS is spam (True) or ham/legitimate (False)

Usage Examples

from river import datasets

dataset = datasets.SMSSpam()
for x, y in dataset:
    print(f"Message: {x['body'][:50]}...")
    print(f"Is spam: {y}")
    break

Example with Text Processing

from river import datasets, feature_extraction, naive_bayes

dataset = datasets.SMSSpam()
model = feature_extraction.BagOfWords() | naive_bayes.BernoulliNB()

for x, y in dataset:
    model.predict_one(x)
    model.learn_one(x, y)

References

  • Almeida, T.A., Hidalgo, J.M.G. and Yamakami, A., 2011, September. Contributions to the study of SMS spam filtering: new collection and results. In Proceedings of the 11th ACM symposium on Document engineering (pp. 259-262). [1]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment