Implementation:Online ml River FeatureExtraction Vectorize
| Knowledge Sources | |
|---|---|
| Domains | Online_Learning, Feature_Engineering, Text_Processing |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Text vectorization methods for converting documents to numeric representations using bag-of-words and TF-IDF.
Description
This module provides streaming text vectorization with two main approaches. BagOfWords counts token occurrences in documents after preprocessing (accent removal, lowercasing, tokenization, stop word removal, n-gram extraction). TFIDF extends BagOfWords by weighting terms according to their inverse document frequency, computed incrementally as documents arrive. Both support customizable preprocessing pipelines, regex-based tokenization, n-gram extraction, and mini-batch processing with sparse DataFrames. The vectorizers handle the complete preprocessing chain from raw text to numeric features.
Usage
Use BagOfWords for simple token counting when term frequency alone is sufficient. Use TFIDF when rare terms should be weighted more heavily than common ones, which is typical in text classification and information retrieval. Both are essential for converting text data into features suitable for machine learning models. Configure n-gram ranges to capture phrase-level information. Use the on parameter to specify which dictionary field contains the text, or pass strings directly.
Code Reference
Source Location
- Repository: Online_ml_River
- File: river/feature_extraction/vectorize.py
Signature
class BagOfWords(base.Transformer, VectorizerMixin):
def __init__(
self,
on: str | None = None,
strip_accents=True,
lowercase=True,
preprocessor: typing.Callable | None = None,
stop_words: set[str] | None = None,
tokenizer_pattern=r"(?u)\b\w[\w\-]+\b",
tokenizer: typing.Callable | None = None,
ngram_range=(1, 1),
)
class TFIDF(BagOfWords):
def __init__(
self,
normalize=True,
on: str | None = None,
strip_accents=True,
lowercase=True,
preprocessor: typing.Callable | None = None,
stop_words: set[str] | None = None,
tokenizer_pattern=r"(?u)\b\w[\w\-]+\b",
tokenizer: typing.Callable | None = None,
ngram_range=(1, 1),
)
Import
from river import feature_extraction
I/O Contract
| Input | Output |
|---|---|
| str or Dict[str, str] - Text document | Dict[str, float] - Token weights |
Usage Examples
from river import feature_extraction as fx
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
# BagOfWords
bow = fx.BagOfWords()
for sentence in corpus:
print(bow.transform_one(sentence))
# {'this': 1, 'is': 1, 'the': 1, 'first': 1, 'document': 1}
# {'this': 1, 'document': 2, 'is': 1, 'the': 1, 'second': 1}
# {'and': 1, 'this': 1, 'is': 1, 'the': 1, 'third': 1, 'one': 1}
# {'is': 1, 'this': 1, 'the': 1, 'first': 1, 'document': 1}
# N-grams
ngrammer = fx.BagOfWords(ngram_range=(1, 2))
ngrams = ngrammer.transform_one('I love the smell of napalm in the morning')
for ngram, count in ngrams.items():
print(ngram, count)
# love 1
# the 2
# ...
# ('love', 'the') 1
# ('the', 'smell') 1
# ...
# TFIDF
tfidf = fx.TFIDF()
for sentence in corpus:
tfidf.learn_one(sentence)
print(tfidf.transform_one(sentence))
# {'this': 0.447, 'is': 0.447, 'the': 0.447, 'first': 0.447, 'document': 0.447}
# {'this': 0.333, 'document': 0.667, 'is': 0.333, 'the': 0.333, 'second': 0.469}
# {'and': 0.497, 'this': 0.293, 'is': 0.293, 'the': 0.293, 'third': 0.497, 'one': 0.497}
# {'is': 0.384, 'this': 0.384, 'the': 0.384, 'first': 0.580, 'document': 0.469}