Principle:Dotnet Machinelearning Sweepable Pipeline Definition
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, AutoML |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
AutoML pipeline structure search uses a symbolic algebra to define the space of possible ML pipelines, enabling automated discovery of both the best algorithm and its optimal hyperparameter configuration.
Description
A sweepable pipeline represents not a single fixed ML pipeline, but an entire search space of pipelines. Traditional ML workflows require the practitioner to manually select a specific algorithm and its hyperparameters. Sweepable pipelines instead encode a declarative specification of all candidate pipelines that an AutoML engine should explore.
The search space is defined through two algebraic operators:
- + (OneOf): Represents alternative estimators at a given pipeline stage. For example,
FastTree + LightGBM + LogisticRegressionmeans the AutoML engine should try each of these trainers and determine which performs best. - * (Concatenate): Represents sequential pipeline steps. For example,
Featurizer * Trainermeans data flows through the featurizer first, then into the trainer.
Each estimator within the pipeline carries an associated SearchSpace that defines the ranges and distributions of its tunable hyperparameters (learning rate, number of leaves, regularization weight, etc.). The combination of structural alternatives (which algorithms) and parametric ranges (which hyperparameter values) creates a rich combinatorial space.
This algebraic approach enables compositional pipeline construction. A featurizer pipeline that handles text, numeric, and categorical columns can be combined with a trainer pipeline that offers multiple classification algorithms, and the AutoML engine will search across the full Cartesian product of structural and parametric choices.
Usage
Use sweepable pipeline definitions when you want an AutoML system to search over both the choice of algorithm and the hyperparameter configuration. This is appropriate when you do not have strong prior knowledge about which algorithm will perform best on your data, or when you want to systematically benchmark multiple approaches. For production scenarios where the algorithm is already known, a fixed (non-sweepable) pipeline is more efficient.
Theoretical Basis
The sweepable pipeline formalism maps onto Combined Algorithm Selection and Hyperparameter optimization (CASH), introduced by Thornton et al. (2013). The CASH problem defines:
Given: a set of algorithms A = {A_1, ..., A_k}
each A_i with hyperparameter space Lambda_i
a dataset D and metric m
Find: A* in A, lambda* in Lambda_{A*}
such that m(A*(lambda*, D_train), D_val) is optimal
The algebraic operators map directly to this formalism:
OneOf(A_1, A_2, ..., A_k) = algorithm selection (choose A*)
Concat(Step_1, Step_2, ...) = pipeline composition (sequential stages)
SearchSpace(A_i) = hyperparameter space Lambda_i
The pipeline search space is the union of all (algorithm, hyperparameter) combinations:
S = Union over i of {A_i} x Lambda_i
A tuner (e.g., Bayesian optimization, random search) samples from S, trains a model for each sample, evaluates on a validation set, and iterates. The sweepable pipeline provides the structured definition of S that the tuner navigates.