Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Pola rs Polars Data IO and Format Conversion

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, ETL, File_IO
Last Updated 2026-02-09 09:30 GMT

Overview

End-to-end process for reading data from various file formats and storage backends, transforming it with Polars, and writing it to a target format.

Description

This workflow covers Polars' comprehensive I/O capabilities across multiple file formats (CSV, Parquet, JSON, NDJSON, IPC/Arrow, Excel, Avro) and storage backends (local filesystem, AWS S3, Azure Blob Storage, Google Cloud Storage, Hugging Face Hub). It includes both eager reading (read_*) and lazy scanning (scan_*), format conversion between any supported pair, and handling of Hive-style partitioned datasets. The workflow also addresses authentication and credential management for cloud storage.

Usage

Execute this workflow when you need to ingest data from one or more sources in different formats, optionally transform it, and write it to a target format or storage location. Common scenarios include converting CSV files to Parquet for better compression and query performance, reading partitioned datasets from cloud storage, and creating data pipelines that bridge multiple data sources.

Execution Steps

Step 1: Configure Storage Access

Set up authentication credentials and storage options for the data source. For local files, no configuration is needed. For cloud storage, configure access keys, credential providers, or assume-role settings.

Key considerations:

  • Local files require no configuration
  • AWS S3 uses storage_options with access keys or CredentialProviderAWS
  • Azure uses CredentialProviderAzure or custom bearer token functions
  • GCS uses storage_options or credential_provider
  • Retry configuration can be set via storage_options for resilience

Step 2: Read or Scan Source Data

Load data into Polars using either eager read functions (for small datasets that fit in memory) or lazy scan functions (for large datasets or when query optimization is desired). Choose the appropriate function for the source format.

Key considerations:

  • Eager: read_csv, read_parquet, read_json, read_ndjson, read_ipc, read_excel, read_avro
  • Lazy: scan_csv, scan_parquet, scan_ndjson, scan_ipc
  • Lazy scanning enables predicate pushdown and projection pushdown at the I/O level
  • For multiple files, use glob patterns (e.g., "data/*.parquet") with scan functions
  • For Hive-partitioned data, set hive_partitioning=True to auto-detect partition columns

Step 3: Transform Data

Apply any necessary transformations such as type casting, column renaming, filtering, or schema alignment between source and target formats. This step ensures data compatibility with the target format requirements.

Key considerations:

  • Cast date strings to proper temporal types using str.to_date or str.to_datetime
  • Rename columns to match target schema conventions
  • Handle null values and missing data according to target format expectations
  • For format conversion, verify that all source data types have valid mappings in the target format

Step 4: Write to Target Format

Write the processed DataFrame to the desired output format and storage location. Choose the appropriate write function and configure format-specific options like compression, row group sizes, or partitioning.

Key considerations:

  • Use write_parquet for columnar analytics (supports snappy, zstd, lz4, gzip compression)
  • Use write_csv for human-readable interchange
  • Use write_ndjson for streaming/append-friendly JSON output
  • Use write_ipc for zero-copy Arrow interchange
  • Use write_excel for business user consumption
  • For partitioned output, specify partition_by columns in sink_parquet

Step 5: Validate Output

Verify the written data by reading it back and comparing schema and row counts. For format conversions, ensure no data loss or type degradation occurred during the conversion process.

Key considerations:

  • Read back the written file and compare schemas
  • Check row counts match between source and target
  • Verify that null values are preserved correctly
  • For Parquet, inspect metadata and statistics using the Parquet reader metadata API

Execution Diagram

GitHub URL

Workflow Repository