Principle:Kubeflow Pipelines Training Data Preparation
Sources: KFP Components
Domains: Data_Engineering, ETL
Last Updated: 2026-02-13
Overview
The process of loading, transforming, and preparing training data from external sources into a format suitable for ML model training.
Description
Data preparation involves:
- Loading raw data from external sources (e.g., the Chicago Taxi dataset)
- Applying transformations (extracting label columns via pandas)
- Formatting outputs for downstream consumption
In KFP pipelines, this is typically a chain of reusable components — a data loader, a transformer, and a header dropper — producing both training data and ground truth labels.
Usage
Use as the first stage in any ML training pipeline when raw data needs to be fetched and preprocessed.
Theoretical Basis
ETL (Extract, Transform, Load) pipeline pattern. Raw data is extracted from a source, transformed to isolate features and labels, then loaded into a format consumable by training components.
| Stage | Description |
|---|---|
| Extract | Fetch raw data from an external source (e.g., BigQuery, CSV endpoint) |
| Transform | Apply pandas expressions to isolate label columns and clean data |
| Load | Output formatted CSV files for downstream training components |