Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Nautechsystems Nautilus trader Parquet Row Group Tuning

From Leeroopedia




Knowledge Sources
Domains Optimization, Data_Persistence
Last Updated 2026-02-10 08:30 GMT

Overview

Parquet write performance tuning via `max_rows_per_group` parameter to control memory usage and read performance when writing to the data catalog.

Description

The `ParquetDataCatalog` uses the `max_rows_per_group` parameter to control how large incoming data batches are split into Parquet row groups during writes. Row groups are the fundamental unit of parallelism in Parquet; each row group can be read independently. Smaller row groups mean lower peak memory during writes but potentially more overhead during reads. Larger row groups improve read compression and scan performance but require more memory during writes.

Usage

Use this heuristic when:

  • Writing large volumes of tick data to the catalog and encountering memory pressure during writes
  • Optimizing catalog read performance for backtests that scan large time ranges
  • Consolidating existing catalog data files

The Insight (Rule of Thumb)

  • Action: Set `max_rows_per_group` when initializing `ParquetDataCatalog`.
  • Value: Default is 5,000 rows per group. This is a balanced default for most use cases.
  • Tuning guidance:
    • Lower (1,000-2,000): Use when writing very wide records or running on memory-constrained systems.
    • Higher (10,000-50,000): Use when writing narrow records (e.g., trade ticks) and read performance is the priority.
  • Trade-off: Smaller row groups reduce write memory but increase read overhead. Larger row groups improve read compression but require more write memory.

Reasoning

Parquet row groups determine the granularity of I/O operations. Each row group has its own column chunk metadata and statistics, which DataFusion uses for predicate pushdown during queries. With very small row groups, the metadata overhead grows relative to actual data. With very large row groups, the engine must read more data than needed for narrow time-range queries. The default of 5,000 rows provides a good balance for typical trading data (trade ticks, bars).

Additionally, the `optimize_file_loading` parameter (default `False`) controls whether DataFusion registers entire directories (more efficient for many files) or individual files (needed for precise control during consolidation operations).

Code Evidence

Default parameter from `persistence/catalog/parquet.py:111-114`:

max_rows_per_group : int, default 5000
    The maximum number of rows per group. If the value is greater than 0,
    then the dataset writer may split up large incoming batches into
    multiple row groups.

Usage in write operation from `persistence/catalog/parquet.py:360`:

row_group_size=self.max_rows_per_group,

optimize_file_loading documentation from `persistence/catalog/parquet.py:1723-1727`:

optimize_file_loading : bool, default False
    If True, registers entire directories with DataFusion, which is
    more efficient for managing many files. If False, registers each file
    individually (needed for operations like consolidation where precise file
    control is required).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment