Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Dotnet Machinelearning Text Featurization Pipeline

From Leeroopedia


Knowledge Sources
Domains Machine Learning, Natural Language Processing, .NET
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tools for converting raw text into numeric feature vectors using ML.NET's text transformation estimators, from the high-level FeaturizeText convenience method to individual normalization, tokenization, stop word removal, and n-gram extraction estimators.

Description

The TransformsCatalog.Text property on MLContext exposes a suite of text processing estimators:

  • FeaturizeText is a high-level estimator that combines text normalization, tokenization, stop word removal, n-gram extraction, and feature weighting into a single step with sensible defaults. It accepts one or more text input columns and produces a single numeric feature vector output column.
  • NormalizeText performs configurable text normalization including case folding (Lower, Upper, None), diacritical mark removal, punctuation stripping, and number removal. Each option is independently toggleable.
  • TokenizeIntoWords splits normalized text into an array of individual word tokens using configurable separator characters.
  • RemoveDefaultStopWords filters out common words from a tokenized word array using built-in stop word lists. The Language enum supports 16 languages including English, French, German, Spanish, Italian, Dutch, Portuguese, Danish, Swedish, Norwegian, Finnish, Polish, Czech, Russian, Japanese, and Arabic.
  • ProduceNgrams extracts n-gram features from a tokenized word array with configurable n-gram length, skip length, maximum count, and weighting criteria (Tf, Idf, TfIdf).

Each estimator implements IEstimator<ITransformer> and can be chained via Append to form a pipeline.

Usage

Use FeaturizeText for rapid prototyping. When classification accuracy needs improvement, decompose into individual estimators to tune normalization rules, language-specific stop words, n-gram parameters, and weighting schemes independently.

Code Reference

Source Location

  • Repository: ML.NET
  • File: src/Microsoft.ML.Transforms/Text/TextCatalog.cs:L36-40 (FeaturizeText)
  • File: src/Microsoft.ML.Transforms/Text/TextCatalog.cs:L134-142 (NormalizeText)
  • File: src/Microsoft.ML.Transforms/Text/TextCatalog.cs:L229-233 (TokenizeIntoWords)
  • File: src/Microsoft.ML.Transforms/Text/TextCatalog.cs:L307-311 (RemoveDefaultStopWords)
  • File: src/Microsoft.ML.Transforms/Text/TextCatalog.cs:L268-277 (ProduceNgrams)
  • File: src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs:L67-786 (FeaturizeText internals)

Signature

// High-level: all-in-one text featurization
public TextFeaturizingEstimator FeaturizeText(
    string outputColumnName,
    string inputColumnName = null)

// Step 1: Normalize text
public TextNormalizingEstimator NormalizeText(
    string outputColumnName,
    string inputColumnName = null,
    TextNormalizingEstimator.CaseMode caseMode
        = TextNormalizingEstimator.CaseMode.Lower,
    bool keepDiacritics = false,
    bool keepPunctuations = true,
    bool keepNumbers = true)

// Step 2: Tokenize into words
public WordTokenizingEstimator TokenizeIntoWords(
    string outputColumnName,
    string inputColumnName = null,
    char[] separators = null)

// Step 3: Remove stop words
public StopWordsRemovingEstimator RemoveDefaultStopWords(
    string outputColumnName,
    string inputColumnName = null,
    StopWordsRemovingEstimator.Language language
        = StopWordsRemovingEstimator.Language.English)

// Step 4: Extract n-grams
public NgramExtractingEstimator ProduceNgrams(
    string outputColumnName,
    string inputColumnName = null,
    int ngramLength = 1,
    int skipLength = 0,
    bool useAllLengths = true,
    int maximumNgramsCount = 10000000,
    NgramExtractingEstimator.WeightingCriteria weighting
        = NgramExtractingEstimator.WeightingCriteria.Tf)

Import

using Microsoft.ML;

I/O Contract

Inputs

FeaturizeText / NormalizeText:

Name Type Required Description
outputColumnName string Yes Name of the output column containing the numeric feature vector.
inputColumnName string No Name of the input text column. Defaults to outputColumnName if null.

NormalizeText additional parameters:

Name Type Required Default Description
caseMode CaseMode No Lower Case transformation: Lower, Upper, or None.
keepDiacritics bool No false Whether to preserve diacritical marks (accents).
keepPunctuations bool No true Whether to preserve punctuation characters.
keepNumbers bool No true Whether to preserve numeric characters.

RemoveDefaultStopWords additional parameters:

Name Type Required Default Description
language Language No English Language for the stop word list. Supports 16 languages.

ProduceNgrams additional parameters:

Name Type Required Default Description
ngramLength int No 1 Maximum n-gram length to extract.
skipLength int No 0 Number of tokens to skip between n-gram components.
useAllLengths bool No true Whether to include all n-gram lengths from 1 to ngramLength.
maximumNgramsCount int No 10000000 Maximum number of n-grams to retain.
weighting WeightingCriteria No Tf Feature weighting scheme: Tf, Idf, or TfIdf.

Outputs

FeaturizeText:

Name Type Description
(return) TextFeaturizingEstimator Estimator that produces a single numeric feature vector column from raw text.

Individual estimators:

Estimator Return Type Output Column Type
NormalizeText TextNormalizingEstimator Normalized string.
TokenizeIntoWords WordTokenizingEstimator Array of word tokens (string[]).
RemoveDefaultStopWords StopWordsRemovingEstimator Filtered array of word tokens (string[]).
ProduceNgrams NgramExtractingEstimator Numeric feature vector (float[]).

Usage Examples

High-Level FeaturizeText

using Microsoft.ML;

var mlContext = new MLContext(seed: 42);

// Load data
IDataView dataView = mlContext.Data.LoadFromTextFile<IssueData>(
    "issues.csv", separatorChar: ',', hasHeader: true);

// One-step text featurization with defaults
var pipeline = mlContext.Transforms.Text.FeaturizeText(
    outputColumnName: "Features",
    inputColumnName: "Description");

// Fit and transform
ITransformer transformer = pipeline.Fit(dataView);
IDataView transformedData = transformer.Transform(dataView);

Custom Step-by-Step Pipeline

using Microsoft.ML;

var mlContext = new MLContext(seed: 42);

IDataView dataView = mlContext.Data.LoadFromTextFile<IssueData>(
    "issues.csv", separatorChar: ',', hasHeader: true);

// Build a custom text processing pipeline
var pipeline = mlContext.Transforms.Text.NormalizeText(
        outputColumnName: "NormalizedText",
        inputColumnName: "Description",
        caseMode: TextNormalizingEstimator.CaseMode.Lower,
        keepDiacritics: false,
        keepPunctuations: false,
        keepNumbers: false)
    .Append(mlContext.Transforms.Text.TokenizeIntoWords(
        outputColumnName: "Tokens",
        inputColumnName: "NormalizedText"))
    .Append(mlContext.Transforms.Text.RemoveDefaultStopWords(
        outputColumnName: "FilteredTokens",
        inputColumnName: "Tokens",
        language: StopWordsRemovingEstimator.Language.English))
    .Append(mlContext.Transforms.Text.ProduceNgrams(
        outputColumnName: "Features",
        inputColumnName: "FilteredTokens",
        ngramLength: 2,
        useAllLengths: true,
        weighting: NgramExtractingEstimator.WeightingCriteria.TfIdf));

ITransformer transformer = pipeline.Fit(dataView);
IDataView transformedData = transformer.Transform(dataView);

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment