Workflow:Dotnet Machinelearning Binary Classification Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Classification, Supervised_Learning |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
End-to-end process for training, evaluating, and deploying a binary classification model using ML.NET to predict yes/no outcomes from structured data.
Description
This workflow outlines the standard procedure for building a binary classification model in ML.NET. Binary classification predicts one of two possible outcomes (e.g., spam vs. not spam, fraud vs. legitimate, churn vs. retain). The process covers loading data into the IDataView pipeline, applying feature engineering transforms (categorical encoding, text featurization, normalization), training a classifier (FastTree, LightGBM, SDCA, or others), evaluating model quality with metrics like AUC and accuracy, saving the trained model, and using it for real-time or batch predictions via PredictionEngine.
Usage
Execute this workflow when you have a labeled dataset with a boolean or binary target column and need to build a predictive model that classifies new observations into one of two categories. Typical scenarios include sentiment analysis, fraud detection, disease diagnosis, customer churn prediction, and email spam filtering.
Execution Steps
Step 1: Initialize MLContext
Create an MLContext instance, which serves as the central entry point for all ML.NET operations. MLContext provides access to data loading, transforms, trainers, evaluation methods, and model persistence. Optionally set a random seed for reproducibility.
Key considerations:
- MLContext is not thread-safe; create one per logical pipeline
- Set a seed value if reproducible results are required across runs
- All subsequent operations use this context as a catalog of available methods
Step 2: Load and Inspect Data
Load the training dataset into an IDataView using TextLoader, LoadFromEnumerable, or a database loader. Define the schema by specifying column names, data types (Boolean for label, String for text, Single for numeric), and their positions in the source file. Optionally split the data into training and test sets.
Key considerations:
- IDataView is lazy; data is not physically loaded until consumed by a trainer or enumeration
- Define the label column as Boolean (DataKind.Boolean) for binary classification
- Use TrainTestSplit to hold out a portion of data for evaluation (e.g., 20% test fraction)
- Inspect data with Preview() during development to verify schema correctness
Step 3: Build Feature Engineering Pipeline
Construct a transformation pipeline that converts raw columns into a numeric feature vector suitable for the trainer. This typically involves encoding categorical string columns via one-hot encoding, featurizing text columns into TF-IDF or n-gram vectors, normalizing numeric columns, and concatenating all transformed columns into a single "Features" vector column.
Key considerations:
- OneHotEncoding converts string columns to indicator vectors
- FeaturizeText handles tokenization, stop word removal, and n-gram extraction for text
- NormalizeMinMax or NormalizeMeanVariance scales numeric features for gradient-based trainers
- Concatenate merges multiple feature columns into the single "Features" column expected by trainers
- Hash-based encoding handles high-cardinality categorical columns efficiently
Step 4: Append Trainer and Train Model
Append a binary classification trainer to the pipeline and call Fit() on the training data to produce a trained model (ITransformer). ML.NET offers multiple binary classification algorithms including FastTree (gradient boosted trees), LightGBM, SDCA (stochastic dual coordinate ascent), and averaged perceptron.
Key considerations:
- FastTree and LightGBM are tree-based ensemble methods that handle non-linear relationships well
- SDCA and logistic regression are linear methods suitable for high-dimensional sparse features
- The trainer outputs PredictedLabel (bool), Score (float), and Probability (float) columns
- Calibration can be applied to convert raw scores into well-calibrated probabilities
- The Fit() call triggers the entire pipeline (transforms + training) on the training data
Step 5: Evaluate Model Quality
Apply the trained model to the held-out test data and compute evaluation metrics. For binary classification, key metrics include AUC (Area Under the ROC Curve), accuracy, F1 score, positive precision, and positive recall. Use these metrics to determine whether the model meets quality requirements.
Key considerations:
- AUC is the primary ranking metric; values above 0.9 indicate strong discrimination
- Accuracy alone can be misleading with imbalanced classes
- Use cross-validation (CrossValidate with N folds) for more robust metric estimates on small datasets
- Permutation Feature Importance reveals which features contribute most to predictions
Step 6: Save Trained Model
Persist the trained model to disk using MLContext.Model.Save(), which serializes the complete pipeline (transforms + trainer) into a .zip file. The saved model includes the data schema and can be loaded in a different process or machine.
Key considerations:
- The saved model includes the full pipeline, so no separate preprocessing code is needed at inference time
- Models are saved with schema information for validation during loading
- Model files are portable across platforms (Windows, Linux, macOS)
Step 7: Deploy for Prediction
Load the saved model and create a PredictionEngine for single-row real-time inference, or use Transform() on an IDataView for batch scoring. Define strongly-typed input and output classes that match the model schema.
Key considerations:
- PredictionEngine is optimized for single-row prediction but is not thread-safe
- For web applications, use PredictionEnginePool from Microsoft.Extensions.ML for thread-safe pooling
- Batch prediction via Transform() is more efficient for scoring large datasets
- Output includes PredictedLabel, Score, and Probability fields