Principle:Kubeflow Pipelines Training Data Preparation

Last Updated: 2026-02-13

Overview

The process of loading, transforming, and preparing training data from external sources into a format suitable for ML model training.

Description

Data preparation involves:

Loading raw data from external sources (e.g., the Chicago Taxi dataset)
Applying transformations (extracting label columns via pandas)
Formatting outputs for downstream consumption

In KFP pipelines, this is typically a chain of reusable components — a data loader, a transformer, and a header dropper — producing both training data and ground truth labels.

Usage

Use as the first stage in any ML training pipeline when raw data needs to be fetched and preprocessed.

Theoretical Basis

ETL (Extract, Transform, Load) pipeline pattern. Raw data is extracted from a source, transformed to isolate features and labels, then loaded into a format consumable by training components.

Stage	Description
Extract	Fetch raw data from an external source (e.g., BigQuery, CSV endpoint)
Transform	Apply pandas expressions to isolate label columns and clean data
Load	Output formatted CSV files for downstream training components

Related Pages

Implementation:Kubeflow_Pipelines_Chicago_Taxi_Dataset_Pipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment