Principle:Scikit learn Scikit learn Column Composition
Overview
A structural pattern that applies different transformation pipelines to different subsets of features and combines the results.
Description
Most real-world datasets contain a mix of feature types: numeric columns that need scaling, categorical columns that need encoding, text columns that need vectorization, and datetime columns that need extraction of components. Applying a single transformer to the entire dataset is either impossible (a scaler cannot process strings) or incorrect (encoding numeric features as categories destroys their ordinal information).
Column composition solves this problem by:
- Partitioning the input features into non-overlapping subsets, each defined by column names, indices, dtype selectors, or callable selectors.
- Routing each subset to a dedicated transformer or sub-pipeline that is appropriate for that feature type.
- Fitting each transformer independently on its assigned subset of the training data.
- Concatenating the outputs of all transformers into a single feature matrix.
This pattern enables heterogeneous data handling within a unified estimator interface. The composed object itself implements fit and transform, so it can be placed inside a larger Pipeline just like any single transformer.
Key design considerations for column composition include:
- Remainder handling: Features not explicitly assigned to any transformer can be dropped (the default), passed through untransformed, or routed to a catch-all transformer.
- Output ordering: The order of features in the output follows the order of transformer specifications. Remainder columns, if preserved, are appended at the end.
- Sparse output: If the combined output is sufficiently sparse (controlled by a density threshold), the result is returned as a sparse matrix. Otherwise it is returned as a dense array.
- Feature name propagation: When verbose feature names are enabled, each output column name is prefixed with the transformer name, ensuring uniqueness.
Usage
Column composition is used whenever a dataset contains multiple feature types that require different preprocessing steps. It is the central structural element in most preprocessing pipelines and is typically placed as the first step in a larger Pipeline before a final estimator.
Theoretical Basis
Column composition is an instance of the composite pattern in software engineering, where individual objects and compositions of objects are treated uniformly through a shared interface. In the context of scikit-learn:
- Each individual transformer has a
fit/transforminterface. - The column composer (ColumnTransformer) also has a
fit/transforminterface. - Client code (e.g., a Pipeline) does not need to know whether it is working with a single transformer or a composed set of transformers.
From a statistical perspective, column composition implements feature-parallel preprocessing: each feature subset is transformed independently, and the transformed features are combined under the assumption that the transformations are conditionally independent given the data. This assumption holds for most standard preprocessing operations (scaling, encoding, imputation) where each feature's transformation depends only on that feature's own values.