Implementation:Kubeflow Pipelines Chicago Taxi Dataset Pipeline
Appearance
Sources: Kubeflow Pipelines
Domains: Data_Engineering, ETL
Last Updated: 2026-02-13
Overview
Wrapper Doc for a chain of reusable KFP components that load and prepare Chicago Taxi training data.
Description
Three components chained together:
chicago_taxi_dataset_op— loads data with SQL filteringpandas_transform_csv_op— extracts label column via pandasdrop_header_op— removes CSV header for metric evaluation
All loaded via components.load_component_from_url().
Code Reference
Source: samples/core/train_until_good/train_until_good.py (L22-27 loading, L71-83 invocation)
Import: from kfp import components
Signature
# chicago_taxi_dataset_op
chicago_taxi_dataset_op(where: str, select: str, limit: int) -> output
# pandas_transform_csv_op
pandas_transform_csv_op(table: CSV, transform_code: str) -> output
# drop_header_op
drop_header_op(table: CSV) -> output
I/O Contract
| Name | Type | Description |
|---|---|---|
| where | str | SQL filter clause |
| select | str | Columns to select |
| limit | int | Row limit |
| transform_code | str | Pandas expression for transformation |
| Name | Type | Description |
|---|---|---|
| training_data | CSV | Prepared training data |
| true_values | headerless CSV | Ground truth labels with header removed |
Usage Examples
training_data = chicago_taxi_dataset_op(
where='trip_start_timestamp >= "2019-01-01" AND trip_start_timestamp < "2019-02-01"',
select='tips,trip_seconds,trip_miles,pickup_community_area,dropoff_community_area,fare,tolls,extras,trip_total',
limit=10000,
).output
true_values_table = pandas_transform_csv_op(
table=training_data,
transform_code='df = df[["tips"]]',
).output
true_values = drop_header_op(true_values_table).output
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment