Implementation:Dotnet Machinelearning Text Featurization Pipeline

Knowledge Sources	ML.NET ML.NET API Reference
Domains	Machine Learning, Natural Language Processing, .NET
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tools for converting raw text into numeric feature vectors using ML.NET's text transformation estimators, from the high-level FeaturizeText convenience method to individual normalization, tokenization, stop word removal, and n-gram extraction estimators.

Description

The TransformsCatalog.Text property on MLContext exposes a suite of text processing estimators:

FeaturizeText is a high-level estimator that combines text normalization, tokenization, stop word removal, n-gram extraction, and feature weighting into a single step with sensible defaults. It accepts one or more text input columns and produces a single numeric feature vector output column.

NormalizeText performs configurable text normalization including case folding (Lower, Upper, None), diacritical mark removal, punctuation stripping, and number removal. Each option is independently toggleable.

TokenizeIntoWords splits normalized text into an array of individual word tokens using configurable separator characters.

RemoveDefaultStopWords filters out common words from a tokenized word array using built-in stop word lists. The Language enum supports 16 languages including English, French, German, Spanish, Italian, Dutch, Portuguese, Danish, Swedish, Norwegian, Finnish, Polish, Czech, Russian, Japanese, and Arabic.

ProduceNgrams extracts n-gram features from a tokenized word array with configurable n-gram length, skip length, maximum count, and weighting criteria (Tf, Idf, TfIdf).

Each estimator implements IEstimator<ITransformer> and can be chained via Append to form a pipeline.

Usage

Use FeaturizeText for rapid prototyping. When classification accuracy needs improvement, decompose into individual estimators to tune normalization rules, language-specific stop words, n-gram parameters, and weighting schemes independently.

Code Reference

Source Location

Repository: ML.NET
File: src/Microsoft.ML.Transforms/Text/TextCatalog.cs:L36-40 (FeaturizeText)
File: src/Microsoft.ML.Transforms/Text/TextCatalog.cs:L134-142 (NormalizeText)
File: src/Microsoft.ML.Transforms/Text/TextCatalog.cs:L229-233 (TokenizeIntoWords)
File: src/Microsoft.ML.Transforms/Text/TextCatalog.cs:L307-311 (RemoveDefaultStopWords)
File: src/Microsoft.ML.Transforms/Text/TextCatalog.cs:L268-277 (ProduceNgrams)
File: src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs:L67-786 (FeaturizeText internals)

Signature

// High-level: all-in-one text featurization
public TextFeaturizingEstimator FeaturizeText(
    string outputColumnName,
    string inputColumnName = null)

// Step 1: Normalize text
public TextNormalizingEstimator NormalizeText(
    string outputColumnName,
    string inputColumnName = null,
    TextNormalizingEstimator.CaseMode caseMode
        = TextNormalizingEstimator.CaseMode.Lower,
    bool keepDiacritics = false,
    bool keepPunctuations = true,
    bool keepNumbers = true)

// Step 2: Tokenize into words
public WordTokenizingEstimator TokenizeIntoWords(
    string outputColumnName,
    string inputColumnName = null,
    char[] separators = null)

// Step 3: Remove stop words
public StopWordsRemovingEstimator RemoveDefaultStopWords(
    string outputColumnName,
    string inputColumnName = null,
    StopWordsRemovingEstimator.Language language
        = StopWordsRemovingEstimator.Language.English)

// Step 4: Extract n-grams
public NgramExtractingEstimator ProduceNgrams(
    string outputColumnName,
    string inputColumnName = null,
    int ngramLength = 1,
    int skipLength = 0,
    bool useAllLengths = true,
    int maximumNgramsCount = 10000000,
    NgramExtractingEstimator.WeightingCriteria weighting
        = NgramExtractingEstimator.WeightingCriteria.Tf)

Import

using Microsoft.ML;

I/O Contract

Inputs

FeaturizeText / NormalizeText:

Name	Type	Required	Description
outputColumnName	string	Yes	Name of the output column containing the numeric feature vector.
inputColumnName	string	No	Name of the input text column. Defaults to outputColumnName if null.

NormalizeText additional parameters:

Name	Type	Required	Default	Description
caseMode	CaseMode	No	Lower	Case transformation: Lower, Upper, or None.
keepDiacritics	bool	No	false	Whether to preserve diacritical marks (accents).
keepPunctuations	bool	No	true	Whether to preserve punctuation characters.
keepNumbers	bool	No	true	Whether to preserve numeric characters.

RemoveDefaultStopWords additional parameters:

Name	Type	Required	Default	Description
language	Language	No	English	Language for the stop word list. Supports 16 languages.

ProduceNgrams additional parameters:

Name	Type	Required	Default	Description
ngramLength	int	No	1	Maximum n-gram length to extract.
skipLength	int	No	0	Number of tokens to skip between n-gram components.
useAllLengths	bool	No	true	Whether to include all n-gram lengths from 1 to ngramLength.
maximumNgramsCount	int	No	10000000	Maximum number of n-grams to retain.
weighting	WeightingCriteria	No	Tf	Feature weighting scheme: Tf, Idf, or TfIdf.

Outputs

FeaturizeText:

Name	Type	Description
(return)	TextFeaturizingEstimator	Estimator that produces a single numeric feature vector column from raw text.

Individual estimators:

Estimator	Return Type	Output Column Type
NormalizeText	TextNormalizingEstimator	Normalized string.
TokenizeIntoWords	WordTokenizingEstimator	Array of word tokens (string[]).
RemoveDefaultStopWords	StopWordsRemovingEstimator	Filtered array of word tokens (string[]).
ProduceNgrams	NgramExtractingEstimator	Numeric feature vector (float[]).

Usage Examples

High-Level FeaturizeText

using Microsoft.ML;

var mlContext = new MLContext(seed: 42);

// Load data
IDataView dataView = mlContext.Data.LoadFromTextFile<IssueData>(
    "issues.csv", separatorChar: ',', hasHeader: true);

// One-step text featurization with defaults
var pipeline = mlContext.Transforms.Text.FeaturizeText(
    outputColumnName: "Features",
    inputColumnName: "Description");

// Fit and transform
ITransformer transformer = pipeline.Fit(dataView);
IDataView transformedData = transformer.Transform(dataView);

Custom Step-by-Step Pipeline

using Microsoft.ML;

var mlContext = new MLContext(seed: 42);

IDataView dataView = mlContext.Data.LoadFromTextFile<IssueData>(
    "issues.csv", separatorChar: ',', hasHeader: true);

// Build a custom text processing pipeline
var pipeline = mlContext.Transforms.Text.NormalizeText(
        outputColumnName: "NormalizedText",
        inputColumnName: "Description",
        caseMode: TextNormalizingEstimator.CaseMode.Lower,
        keepDiacritics: false,
        keepPunctuations: false,
        keepNumbers: false)
    .Append(mlContext.Transforms.Text.TokenizeIntoWords(
        outputColumnName: "Tokens",
        inputColumnName: "NormalizedText"))
    .Append(mlContext.Transforms.Text.RemoveDefaultStopWords(
        outputColumnName: "FilteredTokens",
        inputColumnName: "Tokens",
        language: StopWordsRemovingEstimator.Language.English))
    .Append(mlContext.Transforms.Text.ProduceNgrams(
        outputColumnName: "Features",
        inputColumnName: "FilteredTokens",
        ngramLength: 2,
        useAllLengths: true,
        weighting: NgramExtractingEstimator.WeightingCriteria.TfIdf));

ITransformer transformer = pipeline.Fit(dataView);
IDataView transformedData = transformer.Transform(dataView);

Related Pages

Implements Principle

Principle:Dotnet_Machinelearning_Text_Featurization

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment