Workflow:Dotnet Machinelearning Binary Classification Pipeline

Knowledge Sources	ML.NET ML.NET Docs ML.NET Cookbook ML.NET API Reference
Domains	Machine_Learning, Classification, Supervised_Learning
Last Updated	2026-02-09 12:00 GMT

Overview

End-to-end process for training, evaluating, and deploying a binary classification model using ML.NET to predict yes/no outcomes from structured data.

Description

This workflow outlines the standard procedure for building a binary classification model in ML.NET. Binary classification predicts one of two possible outcomes (e.g., spam vs. not spam, fraud vs. legitimate, churn vs. retain). The process covers loading data into the IDataView pipeline, applying feature engineering transforms (categorical encoding, text featurization, normalization), training a classifier (FastTree, LightGBM, SDCA, or others), evaluating model quality with metrics like AUC and accuracy, saving the trained model, and using it for real-time or batch predictions via PredictionEngine.

Usage

Execute this workflow when you have a labeled dataset with a boolean or binary target column and need to build a predictive model that classifies new observations into one of two categories. Typical scenarios include sentiment analysis, fraud detection, disease diagnosis, customer churn prediction, and email spam filtering.

Execution Steps

Step 1: Initialize MLContext

Create an MLContext instance, which serves as the central entry point for all ML.NET operations. MLContext provides access to data loading, transforms, trainers, evaluation methods, and model persistence. Optionally set a random seed for reproducibility.

Key considerations:

MLContext is not thread-safe; create one per logical pipeline
Set a seed value if reproducible results are required across runs
All subsequent operations use this context as a catalog of available methods

Step 2: Load and Inspect Data

Load the training dataset into an IDataView using TextLoader, LoadFromEnumerable, or a database loader. Define the schema by specifying column names, data types (Boolean for label, String for text, Single for numeric), and their positions in the source file. Optionally split the data into training and test sets.

Key considerations:

IDataView is lazy; data is not physically loaded until consumed by a trainer or enumeration
Define the label column as Boolean (DataKind.Boolean) for binary classification
Use TrainTestSplit to hold out a portion of data for evaluation (e.g., 20% test fraction)
Inspect data with Preview() during development to verify schema correctness

Step 3: Build Feature Engineering Pipeline

Construct a transformation pipeline that converts raw columns into a numeric feature vector suitable for the trainer. This typically involves encoding categorical string columns via one-hot encoding, featurizing text columns into TF-IDF or n-gram vectors, normalizing numeric columns, and concatenating all transformed columns into a single "Features" vector column.

Key considerations:

OneHotEncoding converts string columns to indicator vectors
FeaturizeText handles tokenization, stop word removal, and n-gram extraction for text
NormalizeMinMax or NormalizeMeanVariance scales numeric features for gradient-based trainers
Concatenate merges multiple feature columns into the single "Features" column expected by trainers
Hash-based encoding handles high-cardinality categorical columns efficiently

Step 4: Append Trainer and Train Model

Append a binary classification trainer to the pipeline and call Fit() on the training data to produce a trained model (ITransformer). ML.NET offers multiple binary classification algorithms including FastTree (gradient boosted trees), LightGBM, SDCA (stochastic dual coordinate ascent), and averaged perceptron.

Key considerations:

FastTree and LightGBM are tree-based ensemble methods that handle non-linear relationships well
SDCA and logistic regression are linear methods suitable for high-dimensional sparse features
The trainer outputs PredictedLabel (bool), Score (float), and Probability (float) columns
Calibration can be applied to convert raw scores into well-calibrated probabilities
The Fit() call triggers the entire pipeline (transforms + training) on the training data

Step 5: Evaluate Model Quality

Apply the trained model to the held-out test data and compute evaluation metrics. For binary classification, key metrics include AUC (Area Under the ROC Curve), accuracy, F1 score, positive precision, and positive recall. Use these metrics to determine whether the model meets quality requirements.

Key considerations:

AUC is the primary ranking metric; values above 0.9 indicate strong discrimination
Accuracy alone can be misleading with imbalanced classes
Use cross-validation (CrossValidate with N folds) for more robust metric estimates on small datasets
Permutation Feature Importance reveals which features contribute most to predictions

Step 6: Save Trained Model

Persist the trained model to disk using MLContext.Model.Save(), which serializes the complete pipeline (transforms + trainer) into a .zip file. The saved model includes the data schema and can be loaded in a different process or machine.

Key considerations:

The saved model includes the full pipeline, so no separate preprocessing code is needed at inference time
Models are saved with schema information for validation during loading
Model files are portable across platforms (Windows, Linux, macOS)

Step 7: Deploy for Prediction

Load the saved model and create a PredictionEngine for single-row real-time inference, or use Transform() on an IDataView for batch scoring. Define strongly-typed input and output classes that match the model schema.

Key considerations:

PredictionEngine is optimized for single-row prediction but is not thread-safe
For web applications, use PredictionEnginePool from Microsoft.Extensions.ML for thread-safe pooling
Batch prediction via Transform() is more efficient for scoring large datasets
Output includes PredictedLabel, Score, and Probability fields

Execution Diagram

GitHub URL

Workflow Repository