Implementation:Dotnet Machinelearning Text Featurization Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Natural Language Processing, .NET |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tools for converting raw text into numeric feature vectors using ML.NET's text transformation estimators, from the high-level FeaturizeText convenience method to individual normalization, tokenization, stop word removal, and n-gram extraction estimators.
Description
The TransformsCatalog.Text property on MLContext exposes a suite of text processing estimators:
- FeaturizeText is a high-level estimator that combines text normalization, tokenization, stop word removal, n-gram extraction, and feature weighting into a single step with sensible defaults. It accepts one or more text input columns and produces a single numeric feature vector output column.
- NormalizeText performs configurable text normalization including case folding (Lower, Upper, None), diacritical mark removal, punctuation stripping, and number removal. Each option is independently toggleable.
- TokenizeIntoWords splits normalized text into an array of individual word tokens using configurable separator characters.
- RemoveDefaultStopWords filters out common words from a tokenized word array using built-in stop word lists. The Language enum supports 16 languages including English, French, German, Spanish, Italian, Dutch, Portuguese, Danish, Swedish, Norwegian, Finnish, Polish, Czech, Russian, Japanese, and Arabic.
- ProduceNgrams extracts n-gram features from a tokenized word array with configurable n-gram length, skip length, maximum count, and weighting criteria (Tf, Idf, TfIdf).
Each estimator implements IEstimator<ITransformer> and can be chained via Append to form a pipeline.
Usage
Use FeaturizeText for rapid prototyping. When classification accuracy needs improvement, decompose into individual estimators to tune normalization rules, language-specific stop words, n-gram parameters, and weighting schemes independently.
Code Reference
Source Location
- Repository: ML.NET
- File:
src/Microsoft.ML.Transforms/Text/TextCatalog.cs:L36-40(FeaturizeText) - File:
src/Microsoft.ML.Transforms/Text/TextCatalog.cs:L134-142(NormalizeText) - File:
src/Microsoft.ML.Transforms/Text/TextCatalog.cs:L229-233(TokenizeIntoWords) - File:
src/Microsoft.ML.Transforms/Text/TextCatalog.cs:L307-311(RemoveDefaultStopWords) - File:
src/Microsoft.ML.Transforms/Text/TextCatalog.cs:L268-277(ProduceNgrams) - File:
src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs:L67-786(FeaturizeText internals)
Signature
// High-level: all-in-one text featurization
public TextFeaturizingEstimator FeaturizeText(
string outputColumnName,
string inputColumnName = null)
// Step 1: Normalize text
public TextNormalizingEstimator NormalizeText(
string outputColumnName,
string inputColumnName = null,
TextNormalizingEstimator.CaseMode caseMode
= TextNormalizingEstimator.CaseMode.Lower,
bool keepDiacritics = false,
bool keepPunctuations = true,
bool keepNumbers = true)
// Step 2: Tokenize into words
public WordTokenizingEstimator TokenizeIntoWords(
string outputColumnName,
string inputColumnName = null,
char[] separators = null)
// Step 3: Remove stop words
public StopWordsRemovingEstimator RemoveDefaultStopWords(
string outputColumnName,
string inputColumnName = null,
StopWordsRemovingEstimator.Language language
= StopWordsRemovingEstimator.Language.English)
// Step 4: Extract n-grams
public NgramExtractingEstimator ProduceNgrams(
string outputColumnName,
string inputColumnName = null,
int ngramLength = 1,
int skipLength = 0,
bool useAllLengths = true,
int maximumNgramsCount = 10000000,
NgramExtractingEstimator.WeightingCriteria weighting
= NgramExtractingEstimator.WeightingCriteria.Tf)
Import
using Microsoft.ML;
I/O Contract
Inputs
FeaturizeText / NormalizeText:
| Name | Type | Required | Description |
|---|---|---|---|
| outputColumnName | string | Yes | Name of the output column containing the numeric feature vector. |
| inputColumnName | string | No | Name of the input text column. Defaults to outputColumnName if null. |
NormalizeText additional parameters:
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
| caseMode | CaseMode | No | Lower | Case transformation: Lower, Upper, or None. |
| keepDiacritics | bool | No | false | Whether to preserve diacritical marks (accents). |
| keepPunctuations | bool | No | true | Whether to preserve punctuation characters. |
| keepNumbers | bool | No | true | Whether to preserve numeric characters. |
RemoveDefaultStopWords additional parameters:
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
| language | Language | No | English | Language for the stop word list. Supports 16 languages. |
ProduceNgrams additional parameters:
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
| ngramLength | int | No | 1 | Maximum n-gram length to extract. |
| skipLength | int | No | 0 | Number of tokens to skip between n-gram components. |
| useAllLengths | bool | No | true | Whether to include all n-gram lengths from 1 to ngramLength. |
| maximumNgramsCount | int | No | 10000000 | Maximum number of n-grams to retain. |
| weighting | WeightingCriteria | No | Tf | Feature weighting scheme: Tf, Idf, or TfIdf. |
Outputs
FeaturizeText:
| Name | Type | Description |
|---|---|---|
| (return) | TextFeaturizingEstimator | Estimator that produces a single numeric feature vector column from raw text. |
Individual estimators:
| Estimator | Return Type | Output Column Type |
|---|---|---|
| NormalizeText | TextNormalizingEstimator | Normalized string. |
| TokenizeIntoWords | WordTokenizingEstimator | Array of word tokens (string[]). |
| RemoveDefaultStopWords | StopWordsRemovingEstimator | Filtered array of word tokens (string[]). |
| ProduceNgrams | NgramExtractingEstimator | Numeric feature vector (float[]). |
Usage Examples
High-Level FeaturizeText
using Microsoft.ML;
var mlContext = new MLContext(seed: 42);
// Load data
IDataView dataView = mlContext.Data.LoadFromTextFile<IssueData>(
"issues.csv", separatorChar: ',', hasHeader: true);
// One-step text featurization with defaults
var pipeline = mlContext.Transforms.Text.FeaturizeText(
outputColumnName: "Features",
inputColumnName: "Description");
// Fit and transform
ITransformer transformer = pipeline.Fit(dataView);
IDataView transformedData = transformer.Transform(dataView);
Custom Step-by-Step Pipeline
using Microsoft.ML;
var mlContext = new MLContext(seed: 42);
IDataView dataView = mlContext.Data.LoadFromTextFile<IssueData>(
"issues.csv", separatorChar: ',', hasHeader: true);
// Build a custom text processing pipeline
var pipeline = mlContext.Transforms.Text.NormalizeText(
outputColumnName: "NormalizedText",
inputColumnName: "Description",
caseMode: TextNormalizingEstimator.CaseMode.Lower,
keepDiacritics: false,
keepPunctuations: false,
keepNumbers: false)
.Append(mlContext.Transforms.Text.TokenizeIntoWords(
outputColumnName: "Tokens",
inputColumnName: "NormalizedText"))
.Append(mlContext.Transforms.Text.RemoveDefaultStopWords(
outputColumnName: "FilteredTokens",
inputColumnName: "Tokens",
language: StopWordsRemovingEstimator.Language.English))
.Append(mlContext.Transforms.Text.ProduceNgrams(
outputColumnName: "Features",
inputColumnName: "FilteredTokens",
ngramLength: 2,
useAllLengths: true,
weighting: NgramExtractingEstimator.WeightingCriteria.TfIdf));
ITransformer transformer = pipeline.Fit(dataView);
IDataView transformedData = transformer.Transform(dataView);
Related Pages
Implements Principle
Requires Environment
- Environment:Dotnet_Machinelearning_Dotnet_SDK_And_Runtime
- Environment:Dotnet_Machinelearning_TorchSharp_Environment