Implementation:Scikit learn Scikit learn ColumnTransformer Init
Overview
Concrete tool for applying transformers to column subsets of an array or DataFrame provided by scikit-learn.
Code Reference
Class: ColumnTransformer
Module: sklearn/compose/_column_transformer.py (lines 67-325)
Inheritance: TransformerMixin, _BaseComposition
Constructor signature:
class ColumnTransformer(TransformerMixin, _BaseComposition):
def __init__(
self,
transformers,
*,
remainder="drop",
sparse_threshold=0.3,
n_jobs=None,
transformer_weights=None,
verbose=False,
verbose_feature_names_out=True,
force_int_remainder_cols="deprecated",
):
This estimator allows different columns or column subsets of the input to be transformed separately. The features generated by each transformer are concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.
I/O Contract
Constructor parameters:
transformers:listof tuples -- List of(name, transformer, columns)tuples specifying the transformer objects to be applied to subsets of the data.name:str-- Identifier for the transformer, used inset_paramsand grid search.transformer: estimator,'drop', or'passthrough'-- The transformer to apply. Must supportfitandtransform. Special strings'drop'and'passthrough'are accepted to drop columns or pass them through untransformed.columns:str,array-like,int,slice, orcallable-- Specifies which columns to route to this transformer. A callable (such asmake_column_selector) is passed the input dataXand must return column specifiers.
remainder:'drop','passthrough', or estimator, default='drop'-- Treatment of columns not assigned to any transformer.'passthrough'appends them untransformed to the output; an estimator applies that estimator to them.sparse_threshold:float, default=0.3-- If the combined output density is below this threshold, the result is returned as a sparse matrix.n_jobs:int, default=None-- Number of jobs to run in parallel for fitting transformers.Nonemeans 1 unless in ajoblib.parallel_backendcontext.transformer_weights:dict, default=None-- Multiplicative weights for features per transformer. Keys are transformer names, values are the weights.verbose:bool, default=False-- IfTrue, print the time elapsed while fitting each transformer.verbose_feature_names_out:bool,str, orCallable, default=True-- Controls feature name prefixing inget_feature_names_out. IfTrue, feature names are prefixed with the transformer name. IfFalse, no prefixing (errors if names are not unique). A string or callable allows custom formatting.force_int_remainder_cols:bool, default="deprecated"-- Deprecated parameter, will be removed in version 1.9.
Fitted attributes:
transformers_:list-- The collection of fitted transformers as(name, fitted_transformer, column)tuples.named_transformers_:Bunch-- Read-only attribute to access any transformer by given name.sparse_output_:bool-- Whether the output oftransformis a sparse matrix.output_indices_:dict-- Maps each transformer name to a slice indicating its position in the transformed output.n_features_in_:int-- Number of features seen duringfit.feature_names_in_:ndarrayof shape(n_features_in_,)-- Names of features seen duringfit.
Implementation Details
The ColumnTransformer operates as follows:
- During
fit, it validates the transformer specifications, resolves callable column selectors by passing the input data to them, slices the input to extract each column subset, and fits each transformer on its subset (optionally in parallel viajoblib). - During
transform, it applies each fitted transformer to its column subset and concatenates the results horizontally. If a remainder policy is set, unspecified columns are handled accordingly. - The order of features in the output follows the order of the
transformerslist, with remainder columns appended at the end ifremainder='passthrough'.
Usage Examples
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Normalizer
ct = ColumnTransformer(
[("norm1", Normalizer(norm='l1'), [0, 1]),
("norm2", Normalizer(norm='l1'), slice(2, 4))]
)
X = np.array([[0., 1., 2., 2.],
[1., 1., 0., 1.]])
ct.fit_transform(X)
# array([[0. , 1. , 0.5, 0.5],
# [0.5, 0.5, 0. , 1. ]])
With callable column selectors:
import pandas as pd
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
X = pd.DataFrame({
"age": [25, 30, None, 45],
"city": ["London", "Paris", "London", "Berlin"]
})
preprocessor = ColumnTransformer(
transformers=[
("num", make_pipeline(SimpleImputer(strategy="mean"), StandardScaler()),
make_column_selector(dtype_include=np.number)),
("cat", make_pipeline(SimpleImputer(strategy="most_frequent"), OneHotEncoder()),
make_column_selector(dtype_include=object)),
]
)
preprocessor.fit_transform(X)