Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Dotnet Machinelearning Text Featurization

From Leeroopedia


Knowledge Sources
Domains Machine Learning, Natural Language Processing, Feature Engineering
Last Updated 2026-02-09 00:00 GMT

Overview

Text featurization converts raw text strings into numeric feature vectors that machine learning algorithms can consume, applying a multi-stage pipeline of normalization, tokenization, stop word removal, and n-gram extraction.

Description

Machine learning algorithms operate on numeric vectors, not raw text. Text featurization bridges this gap through a sequence of transformations that progressively convert unstructured text into a structured numeric representation:

  1. Text normalization standardizes the input by applying case folding (converting to lowercase), removing diacritical marks (accents), stripping punctuation, and optionally removing numbers. This reduces surface-level variation so that "Cafe", "cafe", and "CAFE" map to the same token.
  1. Tokenization splits the normalized text into individual words (tokens) using configurable separators. The default splits on whitespace and common punctuation boundaries.
  1. Stop word removal filters out common words that carry no discriminative power for classification tasks (e.g., "the", "is", "and", "of"). ML.NET supports stop word lists for 16 languages, allowing the pipeline to work across multilingual corpora without manual list curation.
  1. N-gram extraction captures word sequences as features. A unigram (n=1) treats each word independently. A bigram (n=2) captures two-word phrases like "very good" or "not bad". Higher-order n-grams capture longer dependencies at the cost of increased dimensionality. N-gram features can be weighted using term frequency (TF), inverse document frequency (IDF), or their product TF-IDF to emphasize discriminative terms.

The high-level FeaturizeText estimator combines all of these steps into a single operation with sensible defaults. For fine-grained control, each step is also available as an independent estimator that can be composed into a custom pipeline.

Usage

Apply text featurization after data loading and before training. Use the high-level FeaturizeText for rapid prototyping with default settings. Switch to individual estimators (NormalizeText, TokenizeIntoWords, RemoveDefaultStopWords, ProduceNgrams) when you need to customize normalization rules, use language-specific stop words, or tune n-gram parameters for your specific domain.

Theoretical Basis

Bag-of-words model: Text featurization implements the bag-of-words assumption, where a document is represented as an unordered collection of its words. The position of words within the text is discarded; only their presence or frequency is retained.

Document d = "the cat sat on the mat"
Tokens     = ["the", "cat", "sat", "on", "the", "mat"]
BoW(d)     = {the: 2, cat: 1, sat: 1, on: 1, mat: 1}

N-gram extension: N-grams extend bag-of-words by considering sequences of n consecutive tokens:

Unigrams (n=1): ["the", "cat", "sat", "on", "the", "mat"]
Bigrams  (n=2): ["the cat", "cat sat", "sat on", "on the", "the mat"]

TF-IDF weighting: Term frequency-inverse document frequency assigns higher weights to terms that are frequent within a document but rare across the corpus:

TF(t, d)      = count(t in d) / |d|
IDF(t, D)     = log(|D| / (1 + count(d in D : t in d)))
TF-IDF(t,d,D) = TF(t, d) * IDF(t, D)

This weighting scheme suppresses ubiquitous terms (like stop words that survive filtering) and amplifies domain-specific discriminative terms.

Stop word filtering: Stop words are the most frequent function words in a language. Removing them reduces feature dimensionality without losing discriminative signal. Support for 16 languages means the framework provides pre-built stop word lists for English, French, German, Spanish, Italian, Dutch, Portuguese, Danish, Swedish, Norwegian, Finnish, Polish, Czech, Russian, Japanese, and Arabic (among others).

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment