Heuristic:Apache Paimon Compression Tuning

Knowledge Sources	Apache Paimon Core Options Descriptions
Domains	Optimization, Data_Engineering
Last Updated	2026-02-08 00:00 GMT

Overview

Use zstd compression at level 1 (default) for balanced read/write speed; increase to level 9 only when storage cost matters more than throughput.

Description

PyPaimon defaults to zstd compression with level 1 for all data files. This provides a good balance of compression ratio and throughput. The compression level can be increased up to 9 for higher compression rates, but the documentation in the code explicitly warns that this "significantly decreases" read and write speed. Additionally, different LSM levels can use different compression strategies via the `file.compression.per.level` option, enabling a tiered approach where frequently-accessed levels use fast compression and cold levels use aggressive compression.

Usage

Apply this heuristic when configuring table write performance or storage cost optimization. Relevant for all workflows that write data files: Table_Read_Write, Data_Ingestion_With_Ray_Sink, Lance_Format_Analytics, and Blob_Storage_With_Descriptors.

The Insight (Rule of Thumb)

Action: Keep `file.compression` at `zstd` (default). Only change `file.compression.zstd-level` if needed.
Value: Level 1 (default) for speed. Level 9 for maximum compression.
Trade-off: Level 9 provides higher compression ratio but "significantly decreases" read and write speed per the source code documentation.
Advanced: Use `file.compression.per.level` to assign different compression per LSM tree level (e.g., level 0 = fast zstd, level 5 = aggressive zstd).

Reasoning

The zstd codec at level 1 is chosen as the default because it provides near-lz4 speeds with significantly better compression ratios. The Paimon project explicitly documents the speed-vs-ratio tradeoff in the configuration description. For hot data (frequently read), level 1 minimizes latency. For archival/cold data, higher levels reduce storage costs at the expense of throughput. The per-level compression option acknowledges that LSM-tree based storage has different access patterns at different levels.

The Python SDK defaults `metadata.stats-mode` to `none` (unlike Java's `truncate(16)`), further optimizing for write throughput over metadata richness.

Code Evidence

Compression defaults from `pypaimon/common/options/core_options.py:125-140`:

FILE_COMPRESSION: ConfigOption[str] = (
    ConfigOptions.key("file.compression")
    .string_type()
    .default_value("zstd")
    .with_description("Default file compression format. For faster read and write, "
                      "it is recommended to use zstd.")
)

FILE_COMPRESSION_ZSTD_LEVEL: ConfigOption[int] = (
    ConfigOptions.key("file.compression.zstd-level")
    .int_type()
    .default_value(1)
    .with_description(
        "Default file compression zstd level. For higher compression rates, "
        "it can be configured to 9, but the read and write speed will "
        "significantly decrease."
    )
)

Per-level compression from `pypaimon/common/options/core_options.py:142-150`:

FILE_COMPRESSION_PER_LEVEL: ConfigOption[Dict[str, str]] = (
    ConfigOptions.key("file.compression.per.level")
    .map_type()
    .default_value({})
    .with_description(
        "Define different compression policies for different level LSM data files, "
        "you can add the level and the corresponding compression type."
    )
)

Python vs Java stats mode from `pypaimon/common/options/core_options.py:169-174`:

METADATA_STATS_MODE: ConfigOption[str] = (
    ConfigOptions.key("metadata.stats-mode")
    .string_type()
    .default_value("none")
    .with_description("Stats Mode, Python by default is none. Java is truncate(16).")
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment