Principle:Kubeflow Pipelines Data Format Conversion

Sources	Apache Parquet, Pandas
Domains	Data_Engineering, ETL
Last Updated	2026-02-13

Overview

The process of transforming data between storage formats (e.g., CSV to Apache Parquet) to optimize for different processing characteristics.

Description

Data format conversion transforms tabular data between serialization formats. CSV is human-readable and widely supported but lacks type information and is inefficient for large datasets. Apache Parquet is a columnar format that provides compression, type preservation, and efficient column-level access. Converting between formats enables pipelines to leverage format-specific advantages at different stages.

Usage

Use when downstream components require a specific format, or when converting to a more efficient format for storage or computation.

Theoretical Basis

Data serialization tradeoffs — row-oriented (CSV) vs. columnar (Parquet):

Columnar formats (Parquet) excel at analytical queries and compression
Row formats (CSV) excel at streaming and simplicity

Related Pages

Implementation:Kubeflow_Pipelines_Convert_CSV_To_Parquet_Op

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment