Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Dotnet Machinelearning Text Classification

From Leeroopedia


Knowledge Sources
Domains Machine_Learning, NLP, Text_Classification
Last Updated 2026-02-09 12:00 GMT

Overview

End-to-end process for classifying text documents into categories using ML.NET's text featurization pipeline and classification trainers.

Description

This workflow outlines the procedure for building a text classification model in ML.NET that assigns category labels to text inputs. It covers the full NLP pipeline from raw text to predictions: text normalization (case conversion, punctuation removal), tokenization, feature extraction (TF-IDF, n-grams, word embeddings), and training a classifier. ML.NET provides two approaches: a traditional feature-engineering pipeline using FeaturizeText with classical ML trainers (SDCA, FastTree), and a deep-learning-based approach using TorchSharp for transformer-style NER and text classification. This workflow focuses on the traditional pipeline which runs on CPU without GPU requirements.

Usage

Execute this workflow when you have labeled text data and need to automatically categorize new text into predefined classes. Typical scenarios include sentiment analysis (positive/negative), email spam filtering, customer support ticket routing, intent detection, topic categorization, and document classification.

Execution Steps

Step 1: Load Text Dataset

Load the labeled text dataset into an IDataView. The dataset should contain at minimum a text column (the input document or sentence) and a label column (the target category). For binary classification the label is Boolean; for multiclass the label is a String that will be key-encoded.

Key considerations:

  • Text columns should be loaded as DataKind.String
  • For multiclass, use MapValueToKey to convert string labels to key-type before training
  • Large text corpora benefit from lazy loading via TextLoader rather than LoadFromEnumerable
  • Inspect sample rows with Preview() to verify text content is loaded correctly

Step 2: Build Text Featurization Pipeline

Construct the text processing pipeline using ML.NET's text transforms. The FeaturizeText transform provides a convenient all-in-one text featurizer that combines normalization, tokenization, stop word removal, n-gram extraction, and TF-IDF weighting into a single operation. Alternatively, compose individual transforms for finer control.

Key considerations:

  • FeaturizeText applies text normalization, word tokenization, n-gram extraction, and TF-IDF in sequence
  • For finer control, chain individual transforms: NormalizeText, TokenizeIntoWords, RemoveDefaultStopWords, ProduceNgrams
  • Word n-grams (unigrams + bigrams) capture local context; character n-grams add robustness to misspellings
  • Stop word removal eliminates common words that add noise (available for 16 languages including English, French, German, Spanish)
  • The output is a dense or sparse float vector suitable for downstream trainers

Step 3: Append Classifier and Train

Append a classification trainer to the featurization pipeline and train the model. For binary text classification, use BinaryClassification trainers (SdcaLogisticRegression, FastTree, AveragedPerceptron). For multiclass, use MulticlassClassification trainers (SdcaMaximumEntropy, LightGBM, OneVersusAll).

Key considerations:

  • SDCA and logistic regression work well with high-dimensional sparse text features
  • LightGBM may require more memory but can capture non-linear patterns in text features
  • For multiclass, ensure MapValueToKey is applied to the label before training and MapKeyToValue after prediction
  • The Fit() call triggers the full pipeline: text featurization followed by model training

Step 4: Evaluate Classification Quality

Score the test data and compute evaluation metrics. For binary classification: AUC, accuracy, F1 score, precision, recall. For multiclass: macro-accuracy, micro-accuracy, log-loss, per-class precision and recall.

Key considerations:

  • Log-loss measures how well predicted probabilities match actual class distributions
  • Per-class metrics help identify categories where the model underperforms
  • Confusion matrix reveals systematic misclassification patterns between similar categories
  • Cross-validation provides more robust metric estimates for smaller text datasets

Step 5: Iterate and Improve

Based on evaluation results, iterate on the pipeline to improve performance. Common improvements include adjusting n-gram size, adding character-level features, using word embeddings instead of bag-of-words, adding more training data, or trying different trainers.

Key considerations:

  • Character n-grams (tri-grams, quad-grams) improve robustness to spelling variations and typos
  • Word embeddings (pre-trained GloVe or Word2Vec) capture semantic similarity between words
  • Feature selection using mutual information or count-based thresholds can reduce noise from rare features
  • Custom text transforms via CustomMapping allow domain-specific preprocessing (e.g., entity masking, URL normalization)

Step 6: Save and Deploy for Inference

Save the complete pipeline (text transforms + trained classifier) and deploy it for inference. Text preprocessing is embedded in the saved model, so raw text can be fed directly to the prediction engine.

Key considerations:

  • The saved model includes all text transforms, so no external text preprocessing is needed
  • PredictionEngine accepts raw text input and returns the predicted category and confidence scores
  • For web APIs, use PredictionEnginePool for thread-safe concurrent predictions
  • The model handles text of any length, though training and test text lengths should be similar in distribution

Execution Diagram

GitHub URL

Workflow Repository