Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Online ml River FeatureExtraction Vectorize

From Leeroopedia


Knowledge Sources
Domains Online_Learning, Feature_Engineering, Text_Processing
Last Updated 2026-02-08 16:00 GMT

Overview

Text vectorization methods for converting documents to numeric representations using bag-of-words and TF-IDF.

Description

This module provides streaming text vectorization with two main approaches. BagOfWords counts token occurrences in documents after preprocessing (accent removal, lowercasing, tokenization, stop word removal, n-gram extraction). TFIDF extends BagOfWords by weighting terms according to their inverse document frequency, computed incrementally as documents arrive. Both support customizable preprocessing pipelines, regex-based tokenization, n-gram extraction, and mini-batch processing with sparse DataFrames. The vectorizers handle the complete preprocessing chain from raw text to numeric features.

Usage

Use BagOfWords for simple token counting when term frequency alone is sufficient. Use TFIDF when rare terms should be weighted more heavily than common ones, which is typical in text classification and information retrieval. Both are essential for converting text data into features suitable for machine learning models. Configure n-gram ranges to capture phrase-level information. Use the on parameter to specify which dictionary field contains the text, or pass strings directly.

Code Reference

Source Location

Signature

class BagOfWords(base.Transformer, VectorizerMixin):
    def __init__(
        self,
        on: str | None = None,
        strip_accents=True,
        lowercase=True,
        preprocessor: typing.Callable | None = None,
        stop_words: set[str] | None = None,
        tokenizer_pattern=r"(?u)\b\w[\w\-]+\b",
        tokenizer: typing.Callable | None = None,
        ngram_range=(1, 1),
    )

class TFIDF(BagOfWords):
    def __init__(
        self,
        normalize=True,
        on: str | None = None,
        strip_accents=True,
        lowercase=True,
        preprocessor: typing.Callable | None = None,
        stop_words: set[str] | None = None,
        tokenizer_pattern=r"(?u)\b\w[\w\-]+\b",
        tokenizer: typing.Callable | None = None,
        ngram_range=(1, 1),
    )

Import

from river import feature_extraction

I/O Contract

Input Output
str or Dict[str, str] - Text document Dict[str, float] - Token weights

Usage Examples

from river import feature_extraction as fx

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

# BagOfWords
bow = fx.BagOfWords()

for sentence in corpus:
    print(bow.transform_one(sentence))
# {'this': 1, 'is': 1, 'the': 1, 'first': 1, 'document': 1}
# {'this': 1, 'document': 2, 'is': 1, 'the': 1, 'second': 1}
# {'and': 1, 'this': 1, 'is': 1, 'the': 1, 'third': 1, 'one': 1}
# {'is': 1, 'this': 1, 'the': 1, 'first': 1, 'document': 1}

# N-grams
ngrammer = fx.BagOfWords(ngram_range=(1, 2))
ngrams = ngrammer.transform_one('I love the smell of napalm in the morning')
for ngram, count in ngrams.items():
    print(ngram, count)
# love 1
# the 2
# ...
# ('love', 'the') 1
# ('the', 'smell') 1
# ...

# TFIDF
tfidf = fx.TFIDF()

for sentence in corpus:
    tfidf.learn_one(sentence)
    print(tfidf.transform_one(sentence))
# {'this': 0.447, 'is': 0.447, 'the': 0.447, 'first': 0.447, 'document': 0.447}
# {'this': 0.333, 'document': 0.667, 'is': 0.333, 'the': 0.333, 'second': 0.469}
# {'and': 0.497, 'this': 0.293, 'is': 0.293, 'the': 0.293, 'third': 0.497, 'one': 0.497}
# {'is': 0.384, 'this': 0.384, 'the': 0.384, 'first': 0.580, 'document': 0.469}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment