Principle:Kubeflow Pipelines Data Format Conversion
| Sources | Apache Parquet, Pandas |
|---|---|
| Domains | Data_Engineering, ETL |
| Last Updated | 2026-02-13 |
Overview
The process of transforming data between storage formats (e.g., CSV to Apache Parquet) to optimize for different processing characteristics.
Description
Data format conversion transforms tabular data between serialization formats. CSV is human-readable and widely supported but lacks type information and is inefficient for large datasets. Apache Parquet is a columnar format that provides compression, type preservation, and efficient column-level access. Converting between formats enables pipelines to leverage format-specific advantages at different stages.
Usage
Use when downstream components require a specific format, or when converting to a more efficient format for storage or computation.
Theoretical Basis
Data serialization tradeoffs — row-oriented (CSV) vs. columnar (Parquet):
- Columnar formats (Parquet) excel at analytical queries and compression
- Row formats (CSV) excel at streaming and simplicity