Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Nautechsystems Nautilus trader Arrow Parquet Serialization

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Data_Persistence
Last Updated 2026-02-10 08:30 GMT

Overview

Apache Arrow (PyArrow >= 22.0.0) and fsspec (>= 2025.2.0) environment for columnar data serialization, Parquet catalog storage, and cloud filesystem access.

Description

NautilusTrader uses Apache Arrow as its primary serialization format for market data persistence. The `ParquetDataCatalog` stores trade ticks, quote ticks, bars, and other data types in Parquet files using PyArrow. The `fsspec` library provides an abstract filesystem layer that allows the catalog to read/write from local disk, S3, GCS, or any fsspec-compatible storage backend. DataFusion is used internally for efficient SQL-based querying of Parquet files.

Usage

This environment is required whenever you use the ParquetDataCatalog for data storage and retrieval. This includes writing historical data to catalogs, querying data for backtesting via `BacktestDataConfig`, and any operation that reads or writes `.parquet` files. It is also required by the `BacktestNode` which loads data from catalog via configuration.

System Requirements

Category Requirement Notes
OS Linux, macOS, or Windows All platforms supported
Disk Varies by dataset size SSD recommended for large catalogs; Parquet is columnar and compact
Memory Sufficient for dataset loading Use `max_rows_per_group` to control memory during writes

Dependencies

Python Packages

  • `pyarrow` >= 22.0.0
  • `fsspec` >= 2025.2.0, <= 2026.1.0
  • `pandas` >= 2.3.3, < 3.0.0 (for DataFrame integration)
  • `numpy` >= 1.26.4

Credentials

The following environment variables may be required:

  • `NAUTILUS_PATH`: Base path for catalog storage. Required when using `ParquetDataCatalog.from_env()`. The catalog is created at `$NAUTILUS_PATH/catalog`.
  • Cloud storage credentials (e.g., `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`) when using fsspec with S3 or other cloud backends.

Quick Install

# Core dependencies (included with nautilus_trader)
pip install pyarrow>=22.0.0 fsspec>=2025.2.0 pandas>=2.3.3

# For S3 catalog access
pip install s3fs

# For GCS catalog access
pip install gcsfs

Code Evidence

NAUTILUS_PATH environment variable check from `persistence/catalog/parquet.py:182-184`:

if _NAUTILUS_PATH not in os.environ:
    raise OSError(f"'{_NAUTILUS_PATH}' environment variable is not set.")

return cls.from_uri(os.environ[_NAUTILUS_PATH] + "/catalog")

max_rows_per_group default from `persistence/catalog/parquet.py:111-114`:

max_rows_per_group : int, default 5000
    The maximum number of rows per group. If the value is greater than 0,
    then the dataset writer may split up large incoming batches into
    multiple row groups.

Common Errors

Error Message Cause Solution
`OSError: 'NAUTILUS_PATH' environment variable is not set` Using `from_env()` without setting the variable Set `export NAUTILUS_PATH=/path/to/data` or use `from_uri()` instead
`ImportError: No module named 'pyarrow'` PyArrow not installed `pip install pyarrow>=22.0.0`
`FileNotFoundError` on catalog path Catalog directory does not exist Ensure the path exists or let `ParquetDataCatalog` create it via `from_uri()`

Compatibility Notes

  • fsspec backends: The catalog supports any fsspec-compatible filesystem (local, S3, GCS, Azure Blob). Install the appropriate fsspec implementation package (e.g., `s3fs`, `gcsfs`).
  • DataFusion: Used internally for SQL queries on Parquet data. The `optimize_file_loading` parameter controls whether entire directories are registered (more efficient for many files) or individual files (needed for consolidation operations).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment