Heuristic:Nautechsystems Nautilus trader Parquet Row Group Tuning
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Data_Persistence |
| Last Updated | 2026-02-10 08:30 GMT |
Overview
Parquet write performance tuning via `max_rows_per_group` parameter to control memory usage and read performance when writing to the data catalog.
Description
The `ParquetDataCatalog` uses the `max_rows_per_group` parameter to control how large incoming data batches are split into Parquet row groups during writes. Row groups are the fundamental unit of parallelism in Parquet; each row group can be read independently. Smaller row groups mean lower peak memory during writes but potentially more overhead during reads. Larger row groups improve read compression and scan performance but require more memory during writes.
Usage
Use this heuristic when:
- Writing large volumes of tick data to the catalog and encountering memory pressure during writes
- Optimizing catalog read performance for backtests that scan large time ranges
- Consolidating existing catalog data files
The Insight (Rule of Thumb)
- Action: Set `max_rows_per_group` when initializing `ParquetDataCatalog`.
- Value: Default is 5,000 rows per group. This is a balanced default for most use cases.
- Tuning guidance:
- Lower (1,000-2,000): Use when writing very wide records or running on memory-constrained systems.
- Higher (10,000-50,000): Use when writing narrow records (e.g., trade ticks) and read performance is the priority.
- Trade-off: Smaller row groups reduce write memory but increase read overhead. Larger row groups improve read compression but require more write memory.
Reasoning
Parquet row groups determine the granularity of I/O operations. Each row group has its own column chunk metadata and statistics, which DataFusion uses for predicate pushdown during queries. With very small row groups, the metadata overhead grows relative to actual data. With very large row groups, the engine must read more data than needed for narrow time-range queries. The default of 5,000 rows provides a good balance for typical trading data (trade ticks, bars).
Additionally, the `optimize_file_loading` parameter (default `False`) controls whether DataFusion registers entire directories (more efficient for many files) or individual files (needed for precise control during consolidation operations).
Code Evidence
Default parameter from `persistence/catalog/parquet.py:111-114`:
max_rows_per_group : int, default 5000
The maximum number of rows per group. If the value is greater than 0,
then the dataset writer may split up large incoming batches into
multiple row groups.
Usage in write operation from `persistence/catalog/parquet.py:360`:
row_group_size=self.max_rows_per_group,
optimize_file_loading documentation from `persistence/catalog/parquet.py:1723-1727`:
optimize_file_loading : bool, default False
If True, registers entire directories with DataFusion, which is
more efficient for managing many files. If False, registers each file
individually (needed for operations like consolidation where precise file
control is required).