Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Kubeflow Pipelines Training Data Preparation

From Leeroopedia

Sources: KFP Components

Domains: Data_Engineering, ETL

Last Updated: 2026-02-13

Overview

The process of loading, transforming, and preparing training data from external sources into a format suitable for ML model training.

Description

Data preparation involves:

  1. Loading raw data from external sources (e.g., the Chicago Taxi dataset)
  2. Applying transformations (extracting label columns via pandas)
  3. Formatting outputs for downstream consumption

In KFP pipelines, this is typically a chain of reusable components — a data loader, a transformer, and a header dropper — producing both training data and ground truth labels.

Usage

Use as the first stage in any ML training pipeline when raw data needs to be fetched and preprocessed.

Theoretical Basis

ETL (Extract, Transform, Load) pipeline pattern. Raw data is extracted from a source, transformed to isolate features and labels, then loaded into a format consumable by training components.

Stage Description
Extract Fetch raw data from an external source (e.g., BigQuery, CSV endpoint)
Transform Apply pandas expressions to isolate label columns and clean data
Load Output formatted CSV files for downstream training components

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment