Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Scikit learn Scikit learn ColumnTransformer Init

From Leeroopedia


Template:Metadata

Overview

Concrete tool for applying transformers to column subsets of an array or DataFrame provided by scikit-learn.

Code Reference

Class: ColumnTransformer

Module: sklearn/compose/_column_transformer.py (lines 67-325)

Inheritance: TransformerMixin, _BaseComposition

Constructor signature:

class ColumnTransformer(TransformerMixin, _BaseComposition):
    def __init__(
        self,
        transformers,
        *,
        remainder="drop",
        sparse_threshold=0.3,
        n_jobs=None,
        transformer_weights=None,
        verbose=False,
        verbose_feature_names_out=True,
        force_int_remainder_cols="deprecated",
    ):

This estimator allows different columns or column subsets of the input to be transformed separately. The features generated by each transformer are concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.

I/O Contract

Constructor parameters:

  • transformers : list of tuples -- List of (name, transformer, columns) tuples specifying the transformer objects to be applied to subsets of the data.
    • name : str -- Identifier for the transformer, used in set_params and grid search.
    • transformer : estimator, 'drop', or 'passthrough' -- The transformer to apply. Must support fit and transform. Special strings 'drop' and 'passthrough' are accepted to drop columns or pass them through untransformed.
    • columns : str, array-like, int, slice, or callable -- Specifies which columns to route to this transformer. A callable (such as make_column_selector) is passed the input data X and must return column specifiers.
  • remainder : 'drop', 'passthrough', or estimator, default='drop' -- Treatment of columns not assigned to any transformer. 'passthrough' appends them untransformed to the output; an estimator applies that estimator to them.
  • sparse_threshold : float, default=0.3 -- If the combined output density is below this threshold, the result is returned as a sparse matrix.
  • n_jobs : int, default=None -- Number of jobs to run in parallel for fitting transformers. None means 1 unless in a joblib.parallel_backend context.
  • transformer_weights : dict, default=None -- Multiplicative weights for features per transformer. Keys are transformer names, values are the weights.
  • verbose : bool, default=False -- If True, print the time elapsed while fitting each transformer.
  • verbose_feature_names_out : bool, str, or Callable, default=True -- Controls feature name prefixing in get_feature_names_out. If True, feature names are prefixed with the transformer name. If False, no prefixing (errors if names are not unique). A string or callable allows custom formatting.
  • force_int_remainder_cols : bool, default="deprecated" -- Deprecated parameter, will be removed in version 1.9.

Fitted attributes:

  • transformers_ : list -- The collection of fitted transformers as (name, fitted_transformer, column) tuples.
  • named_transformers_ : Bunch -- Read-only attribute to access any transformer by given name.
  • sparse_output_ : bool -- Whether the output of transform is a sparse matrix.
  • output_indices_ : dict -- Maps each transformer name to a slice indicating its position in the transformed output.
  • n_features_in_ : int -- Number of features seen during fit.
  • feature_names_in_ : ndarray of shape (n_features_in_,) -- Names of features seen during fit.

Implementation Details

The ColumnTransformer operates as follows:

  1. During fit, it validates the transformer specifications, resolves callable column selectors by passing the input data to them, slices the input to extract each column subset, and fits each transformer on its subset (optionally in parallel via joblib).
  2. During transform, it applies each fitted transformer to its column subset and concatenates the results horizontally. If a remainder policy is set, unspecified columns are handled accordingly.
  3. The order of features in the output follows the order of the transformers list, with remainder columns appended at the end if remainder='passthrough'.

Usage Examples

import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Normalizer

ct = ColumnTransformer(
    [("norm1", Normalizer(norm='l1'), [0, 1]),
     ("norm2", Normalizer(norm='l1'), slice(2, 4))]
)
X = np.array([[0., 1., 2., 2.],
              [1., 1., 0., 1.]])
ct.fit_transform(X)
# array([[0. , 1. , 0.5, 0.5],
#        [0.5, 0.5, 0. , 1. ]])

With callable column selectors:

import pandas as pd
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline

X = pd.DataFrame({
    "age": [25, 30, None, 45],
    "city": ["London", "Paris", "London", "Berlin"]
})

preprocessor = ColumnTransformer(
    transformers=[
        ("num", make_pipeline(SimpleImputer(strategy="mean"), StandardScaler()),
         make_column_selector(dtype_include=np.number)),
        ("cat", make_pipeline(SimpleImputer(strategy="most_frequent"), OneHotEncoder()),
         make_column_selector(dtype_include=object)),
    ]
)
preprocessor.fit_transform(X)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment