Implementation:Scikit learn Scikit learn Pipeline Fit Predict

Overview

Concrete tool for fitting and predicting with a pipeline of transformers and a final estimator provided by scikit-learn.

Code Reference

Class: Pipeline

Module: sklearn/pipeline.py

Methods documented:

Pipeline.fit (lines 567-624)
Pipeline.predict (lines 698-753)
Pipeline.fit_transform (lines 638-696)

Pipeline.fit

Signature:

def fit(self, X, y=None, **params):

Fit all the transformers one after the other and sequentially transform the data. Finally, fit the transformed data using the final estimator.

Pipeline.predict

Signature:

def predict(self, X, **params):

Transform the data through all intermediate steps, and apply predict with the final estimator. Only valid if the final estimator implements predict.

Pipeline.fit_transform

Signature:

def fit_transform(self, X, y=None, **params):

Fit the model and transform with the final estimator. Only valid if the final estimator implements fit_transform or both fit and transform.

I/O Contract

fit parameters:

X : iterable -- Training data. Must fulfill input requirements of the first step of the pipeline.
y : iterable, default=None -- Training targets. Must fulfill label requirements for all steps of the pipeline.
**params : dict of str -> object -- With metadata routing disabled (default): parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p. With metadata routing enabled: parameters are routed to steps that have requested them.

fit returns:

self : Pipeline -- The fitted pipeline object.

predict parameters:

X : iterable -- Data to predict on. Must fulfill input requirements of the first step of the pipeline.
**params : dict of str -> object -- With metadata routing disabled: parameters to the predict method of the final estimator. With metadata routing enabled: parameters are routed to appropriate steps.

predict returns:

y_pred : ndarray -- Result of calling predict on the final estimator.

fit_transform returns:

Xt : ndarray of shape (n_samples, n_transformed_features) -- Transformed samples.

Implementation Details

fit method

The fit method executes the fit-transform cascade:

Validates routing parameters via _check_method_params.
Calls the internal _fit method, which iterates through all intermediate steps (all steps except the last). For each intermediate step, it calls fit_transform (if available) or fit followed by transform, passing the output as input to the next step.
Fits the final estimator on the fully transformed data by calling self._final_estimator.fit(Xt, y, **last_step_params["fit"]).
If the final estimator is the string "passthrough", the final fit step is skipped.
Returns self.

predict method

The predict method executes the transform-predict cascade:

Calls check_is_fitted(self) to verify that the pipeline has been fitted.
Iterates through all intermediate steps (excluding the final estimator), calling transform.transform(Xt) on each to sequentially transform the input data.
Calls self.steps[-1][1].predict(Xt, **params) on the final estimator.
When metadata routing is enabled, parameters are routed to the appropriate steps' transform methods and to the final estimator's predict method.

fit_transform method

The fit_transform method combines fitting and transforming:

Runs the same _fit cascade as fit for all intermediate steps.
For the final step, calls fit_transform if available, otherwise calls fit followed by transform.
If the final estimator is "passthrough", returns the already-transformed data directly.

Usage Examples

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])

# fit: runs StandardScaler.fit_transform on X_train, then SVC.fit on scaled data
pipe.fit(X_train, y_train)

# predict: runs StandardScaler.transform on X_test, then SVC.predict on scaled data
y_pred = pipe.predict(X_test)

# score: internally calls predict and compares with y_test
accuracy = pipe.score(X_test, y_test)  # 0.88

Passing step-specific parameters:

# Set parameters of the SVC step using the '__' separator
pipe.set_params(svc__C=10)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)  # 0.76

Full preprocessing pipeline with fit and predict:

import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

# Build preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ("num", make_pipeline(SimpleImputer(strategy="median"), StandardScaler()),
         make_column_selector(dtype_include=np.number)),
        ("cat", make_pipeline(SimpleImputer(strategy="constant"), OneHotEncoder()),
         make_column_selector(dtype_include=object)),
    ]
)

# Build full pipeline
clf = make_pipeline(preprocessor, LogisticRegression())

# Execute: fit learns all parameters, predict applies them
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment