Implementation:Online ml River FeatureExtraction Vectorize

Knowledge Sources	Online_ml_River
Domains	Online_Learning, Feature_Engineering, Text_Processing
Last Updated	2026-02-08 16:00 GMT

Overview

Text vectorization methods for converting documents to numeric representations using bag-of-words and TF-IDF.

Description

This module provides streaming text vectorization with two main approaches. BagOfWords counts token occurrences in documents after preprocessing (accent removal, lowercasing, tokenization, stop word removal, n-gram extraction). TFIDF extends BagOfWords by weighting terms according to their inverse document frequency, computed incrementally as documents arrive. Both support customizable preprocessing pipelines, regex-based tokenization, n-gram extraction, and mini-batch processing with sparse DataFrames. The vectorizers handle the complete preprocessing chain from raw text to numeric features.

Usage

Use BagOfWords for simple token counting when term frequency alone is sufficient. Use TFIDF when rare terms should be weighted more heavily than common ones, which is typical in text classification and information retrieval. Both are essential for converting text data into features suitable for machine learning models. Configure n-gram ranges to capture phrase-level information. Use the on parameter to specify which dictionary field contains the text, or pass strings directly.

Code Reference

Source Location

Repository: Online_ml_River
File: river/feature_extraction/vectorize.py

Signature

class BagOfWords(base.Transformer, VectorizerMixin):
    def __init__(
        self,
        on: str | None = None,
        strip_accents=True,
        lowercase=True,
        preprocessor: typing.Callable | None = None,
        stop_words: set[str] | None = None,
        tokenizer_pattern=r"(?u)\b\w[\w\-]+\b",
        tokenizer: typing.Callable | None = None,
        ngram_range=(1, 1),
    )

class TFIDF(BagOfWords):
    def __init__(
        self,
        normalize=True,
        on: str | None = None,
        strip_accents=True,
        lowercase=True,
        preprocessor: typing.Callable | None = None,
        stop_words: set[str] | None = None,
        tokenizer_pattern=r"(?u)\b\w[\w\-]+\b",
        tokenizer: typing.Callable | None = None,
        ngram_range=(1, 1),
    )

Import

from river import feature_extraction

I/O Contract

Input	Output
str or Dict[str, str] - Text document	Dict[str, float] - Token weights

Usage Examples

from river import feature_extraction as fx

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

# BagOfWords
bow = fx.BagOfWords()

for sentence in corpus:
    print(bow.transform_one(sentence))
# {'this': 1, 'is': 1, 'the': 1, 'first': 1, 'document': 1}
# {'this': 1, 'document': 2, 'is': 1, 'the': 1, 'second': 1}
# {'and': 1, 'this': 1, 'is': 1, 'the': 1, 'third': 1, 'one': 1}
# {'is': 1, 'this': 1, 'the': 1, 'first': 1, 'document': 1}

# N-grams
ngrammer = fx.BagOfWords(ngram_range=(1, 2))
ngrams = ngrammer.transform_one('I love the smell of napalm in the morning')
for ngram, count in ngrams.items():
    print(ngram, count)
# love 1
# the 2
# ...
# ('love', 'the') 1
# ('the', 'smell') 1
# ...

# TFIDF
tfidf = fx.TFIDF()

for sentence in corpus:
    tfidf.learn_one(sentence)
    print(tfidf.transform_one(sentence))
# {'this': 0.447, 'is': 0.447, 'the': 0.447, 'first': 0.447, 'document': 0.447}
# {'this': 0.333, 'document': 0.667, 'is': 0.333, 'the': 0.333, 'second': 0.469}
# {'and': 0.497, 'this': 0.293, 'is': 0.293, 'the': 0.293, 'third': 0.497, 'one': 0.497}
# {'is': 0.384, 'this': 0.384, 'the': 0.384, 'first': 0.580, 'document': 0.469}

Related Pages

Environment:Online_ml_River_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment