Heuristic:Apache Paimon File Sizing and Split Planning
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Data_Engineering |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Target 128 MB files for primary key tables and 256 MB for append-only tables; use 128 MB split targets with 4 MB file-open cost to balance parallelism and overhead.
Description
PyPaimon uses several interrelated size parameters to control data file creation and read split planning. The target file size determines when a new data file is started during writes. The split target size and file-open cost control how data files are grouped into read splits for parallel scanning. The batch size (1024 records) controls how many rows are processed per iteration. The manifest target size (8 MB) and merge threshold (30 files) control manifest compaction. These defaults represent the Paimon project's empirical tuning for balanced throughput.
Usage
Apply this heuristic when tuning write performance (file sizes), read parallelism (split sizes), or commit overhead (manifest sizes). Relevant for all data access workflows but especially important for large-scale distributed processing with Ray.
The Insight (Rule of Thumb)
- Target File Size:
- Primary key tables: 128 MB (smaller for better compaction/merge performance)
- Append-only tables: 256 MB (larger since no merging needed)
- Blob files: 256 MB (large binary data benefits from larger files)
- Split Planning:
- Target split size: 128 MB (balance between parallelism and task overhead)
- File-open cost: 4 MB (penalizes splits with many small files to avoid excessive file opens)
- Read Batching:
- Batch size: 1024 records (balance between memory usage and CPU efficiency)
- Manifest Management:
- Manifest target size: 8 MB
- Merge threshold: 30 manifest files (trigger compaction when exceeded)
- Commit Retries:
- Max retries: 10, min wait: 10ms, max wait: 10s (exponential backoff)
Reasoning
The 128 MB vs 256 MB distinction for primary key tables reflects the compaction overhead tradeoff: primary key tables undergo LSM-tree style merge operations, so smaller files reduce the per-merge I/O cost. Append-only tables never compact, so larger files reduce metadata overhead.
The 4 MB file-open cost is an important heuristic for split planning. Without this cost, the planner might create a split containing hundreds of tiny files, leading to excessive file handle usage and metadata overhead. By treating each file as if it contributes 4 MB of "virtual" size, the planner naturally groups small files into fewer, larger splits.
The 1024-record batch size is standard for columnar readers (Arrow, Parquet), providing enough rows for vectorized operations without excessive memory per batch.
The manifest merge threshold of 30 files prevents unbounded manifest list growth while avoiding excessive compaction overhead.
Code Evidence
Target file size logic from `pypaimon/common/options/core_options.py:481-484`:
def target_file_size(self, has_primary_key, default=None):
return self.options.get(CoreOptions.TARGET_FILE_SIZE,
MemorySize.of_mebi_bytes(
128 if has_primary_key else 256) if default is None
else default)
Split target and file-open cost from `pypaimon/common/options/core_options.py:217-232`:
SOURCE_SPLIT_TARGET_SIZE: ConfigOption[MemorySize] = (
ConfigOptions.key("source.split.target-size")
.memory_type()
.default_value(MemorySize.of_mebi_bytes(128))
.with_description("The target size of a source split when scanning a table.")
)
SOURCE_SPLIT_OPEN_FILE_COST: ConfigOption[MemorySize] = (
ConfigOptions.key("source.split.open-file-cost")
.memory_type()
.default_value(MemorySize.of_mebi_bytes(4))
.with_description(
"The estimated cost to open a file, used when scanning a table. "
"It is used to avoid opening too many small files."
)
)
Manifest sizing from `pypaimon/write/file_store_commit.py:87-88`:
self.manifest_target_size = 8 * 1024 * 1024 # 8 MB
self.manifest_merge_min_count = 30
Batch size default from `pypaimon/common/options/core_options.py:414-419`:
READ_BATCH_SIZE: ConfigOption[int] = (
ConfigOptions.key("read.batch-size")
.int_type()
.default_value(1024)
.with_description("Read batch size for any file format if it supports.")
)
Commit retry configuration from `pypaimon/common/options/core_options.py:255-281`:
COMMIT_MAX_RETRIES: ConfigOption[int] = (
ConfigOptions.key("commit.max-retries")
.int_type()
.default_value(10)
)
COMMIT_MIN_RETRY_WAIT: ConfigOption[timedelta] = (
ConfigOptions.key("commit.min-retry-wait")
.duration_type()
.default_value(timedelta(milliseconds=10))
)
COMMIT_MAX_RETRY_WAIT: ConfigOption[timedelta] = (
ConfigOptions.key("commit.max-retry-wait")
.duration_type()
.default_value(timedelta(seconds=10))
)
Ray greedy knapsack split distribution from `pypaimon/read/datasource/ray_datasource.py:76-93`:
@staticmethod
def _distribute_splits_into_equal_chunks(
splits: Iterable[Split], n_chunks: int
) -> List[List[Split]]:
"""
Implement a greedy knapsack algorithm to distribute the splits
across tasks, based on their file size, as evenly as possible.
"""
chunks = [list() for _ in range(n_chunks)]
chunk_sizes = [(0, chunk_id) for chunk_id in range(n_chunks)]
heapq.heapify(chunk_sizes)
for split in sorted(splits, key=lambda s: s.file_size
if hasattr(s, 'file_size') and s.file_size > 0 else 0, reverse=True):
...
Related Pages
- Implementation:Apache_Paimon_BatchTableWrite_Write_Arrow
- Implementation:Apache_Paimon_BatchTableWrite_Write_Pandas
- Implementation:Apache_Paimon_ReadBuilder_Scan
- Implementation:Apache_Paimon_TableRead_To_Arrow
- Implementation:Apache_Paimon_TableCommit_Commit
- Implementation:Apache_Paimon_TableRead_To_Ray
- Implementation:Apache_Paimon_BatchTableWrite_Write_Ray