Implementation:Kubeflow Pipelines Chicago Taxi Dataset Pipeline

Sources: Kubeflow Pipelines

Domains: Data_Engineering, ETL

Last Updated: 2026-02-13

Overview

Wrapper Doc for a chain of reusable KFP components that load and prepare Chicago Taxi training data.

Description

Three components chained together:

chicago_taxi_dataset_op — loads data with SQL filtering
pandas_transform_csv_op — extracts label column via pandas
drop_header_op — removes CSV header for metric evaluation

All loaded via components.load_component_from_url().

Code Reference

Source: samples/core/train_until_good/train_until_good.py (L22-27 loading, L71-83 invocation)

Import: from kfp import components

Signature

# chicago_taxi_dataset_op
chicago_taxi_dataset_op(where: str, select: str, limit: int) -> output

# pandas_transform_csv_op
pandas_transform_csv_op(table: CSV, transform_code: str) -> output

# drop_header_op
drop_header_op(table: CSV) -> output

I/O Contract

**Inputs**
Name	Type	Description
where	str	SQL filter clause
select	str	Columns to select
limit	int	Row limit
transform_code	str	Pandas expression for transformation

**Outputs**
Name	Type	Description
training_data	CSV	Prepared training data
true_values	headerless CSV	Ground truth labels with header removed

Usage Examples

training_data = chicago_taxi_dataset_op(
    where='trip_start_timestamp >= "2019-01-01" AND trip_start_timestamp < "2019-02-01"',
    select='tips,trip_seconds,trip_miles,pickup_community_area,dropoff_community_area,fare,tolls,extras,trip_total',
    limit=10000,
).output

true_values_table = pandas_transform_csv_op(
    table=training_data,
    transform_code='df = df[["tips"]]',
).output

true_values = drop_header_op(true_values_table).output

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment