Implementation:Online ml River Datasets SMSSpam
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Online_Learning, Datasets, Binary_Classification, Text_Classification |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Concrete dataset for binary text classification provided by the River library.
Description
SMS Spam Collection dataset. The data contains 5,574 items and 1 feature (i.e. SMS body). Spam messages represent 13.4% of the dataset. The goal is to predict whether an SMS is a spam or not.
This dataset contains 5,574 samples with 1 text feature for binary classification tasks.
Usage
This dataset is useful for:
- Text classification and spam detection
- Natural language processing tasks
- Imbalanced classification problems
- SMS/message filtering applications
Code Reference
Source Location
- Repository: Online_ml_River
- File: river/datasets/sms_spam.py
Signature
class SMSSpam(base.RemoteDataset):
def __init__(self):
super().__init__(
n_samples=5_574,
n_features=1,
task=base.BINARY_CLF,
url="https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip",
size=477_907,
filename="SMSSpamCollection",
)
def _iter(self):
with open(self.path) as f:
for row in f:
label, body = row.split("\t")
yield ({"body": body}, label == "spam")
Import
from river import datasets
dataset = datasets.SMSSpam()
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| (none) | — | — | No parameters needed |
Outputs
| Name | Type | Description |
|---|---|---|
| iter() | tuple(dict, bool) | Yields ({"body": text}, is_spam) pairs |
Dataset Properties
| Property | Value |
|---|---|
| Number of samples | 5,574 |
| Number of features | 1 (text) |
| Task | Binary classification |
| Format | Tab-separated text |
| Size | 477,907 bytes |
| Spam percentage | 13.4% |
Features
- body: The SMS message text content (string)
Target
- Boolean value indicating whether the SMS is spam (True) or ham/legitimate (False)
Usage Examples
from river import datasets
dataset = datasets.SMSSpam()
for x, y in dataset:
print(f"Message: {x['body'][:50]}...")
print(f"Is spam: {y}")
break
Example with Text Processing
from river import datasets, feature_extraction, naive_bayes
dataset = datasets.SMSSpam()
model = feature_extraction.BagOfWords() | naive_bayes.BernoulliNB()
for x, y in dataset:
model.predict_one(x)
model.learn_one(x, y)
References
- Almeida, T.A., Hidalgo, J.M.G. and Yamakami, A., 2011, September. Contributions to the study of SMS spam filtering: new collection and results. In Proceedings of the 11th ACM symposium on Document engineering (pp. 259-262). [1]
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment