Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Eventual Inc Daft Parquet Writing

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Data_Storage
Last Updated 2026-02-08 00:00 GMT

Overview

Technique for persisting DataFrame contents to Apache Parquet columnar files.

Description

Writing to Parquet enables efficient long-term storage of analytical data with compression, column pruning, and partitioning support. Parquet is a columnar file format that stores data by column rather than by row, enabling efficient compression (similar values are grouped together) and selective column reading. Daft supports append, overwrite, and partition-level overwrite modes, as well as Hive-style directory partitioning for organizing output files.

Usage

Use Parquet writing when you need to persist processed data to efficient columnar storage. This is the preferred format for analytical workloads, data lake architectures, and inter-system data exchange where query performance and storage efficiency are important.

Theoretical Basis

Columnar serialization with Parquet involves:

Parquet File Structure:
  File = [Row Group 1] [Row Group 2] ... [Footer]
  Row Group = [Column Chunk 1] [Column Chunk 2] ...
  Column Chunk = [Page 1] [Page 2] ...

Write Modes:
  - append: add new files alongside existing data
  - overwrite: replace all existing data
  - overwrite-partitions: replace only affected partition directories

Hive-Style Partitioning:
  root_dir/
    partition_col=value1/
      data_file_1.parquet
    partition_col=value2/
      data_file_2.parquet

Compression Codecs:
  - snappy (default): fast compression/decompression
  - gzip: higher compression ratio
  - zstd: balanced speed and ratio
  - lz4: fastest compression

Partitioning enables partition pruning during reads, where only relevant directories are scanned based on filter predicates.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment