Environment:Nautechsystems Nautilus trader Arrow Parquet Serialization

Knowledge Sources	NautilusTrader pyproject.toml
Domains	Infrastructure, Data_Persistence
Last Updated	2026-02-10 08:30 GMT

Overview

Apache Arrow (PyArrow >= 22.0.0) and fsspec (>= 2025.2.0) environment for columnar data serialization, Parquet catalog storage, and cloud filesystem access.

Description

NautilusTrader uses Apache Arrow as its primary serialization format for market data persistence. The `ParquetDataCatalog` stores trade ticks, quote ticks, bars, and other data types in Parquet files using PyArrow. The `fsspec` library provides an abstract filesystem layer that allows the catalog to read/write from local disk, S3, GCS, or any fsspec-compatible storage backend. DataFusion is used internally for efficient SQL-based querying of Parquet files.

Usage

This environment is required whenever you use the ParquetDataCatalog for data storage and retrieval. This includes writing historical data to catalogs, querying data for backtesting via `BacktestDataConfig`, and any operation that reads or writes `.parquet` files. It is also required by the `BacktestNode` which loads data from catalog via configuration.

System Requirements

Category	Requirement	Notes
OS	Linux, macOS, or Windows	All platforms supported
Disk	Varies by dataset size	SSD recommended for large catalogs; Parquet is columnar and compact
Memory	Sufficient for dataset loading	Use `max_rows_per_group` to control memory during writes

Dependencies

Python Packages

`pyarrow` >= 22.0.0
`fsspec` >= 2025.2.0, <= 2026.1.0
`pandas` >= 2.3.3, < 3.0.0 (for DataFrame integration)
`numpy` >= 1.26.4

Credentials

The following environment variables may be required:

`NAUTILUS_PATH`: Base path for catalog storage. Required when using `ParquetDataCatalog.from_env()`. The catalog is created at `$NAUTILUS_PATH/catalog`.
Cloud storage credentials (e.g., `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`) when using fsspec with S3 or other cloud backends.

Quick Install

# Core dependencies (included with nautilus_trader)
pip install pyarrow>=22.0.0 fsspec>=2025.2.0 pandas>=2.3.3

# For S3 catalog access
pip install s3fs

# For GCS catalog access
pip install gcsfs

Code Evidence

NAUTILUS_PATH environment variable check from `persistence/catalog/parquet.py:182-184`:

if _NAUTILUS_PATH not in os.environ:
    raise OSError(f"'{_NAUTILUS_PATH}' environment variable is not set.")

return cls.from_uri(os.environ[_NAUTILUS_PATH] + "/catalog")

max_rows_per_group default from `persistence/catalog/parquet.py:111-114`:

max_rows_per_group : int, default 5000
    The maximum number of rows per group. If the value is greater than 0,
    then the dataset writer may split up large incoming batches into
    multiple row groups.

Common Errors

Error Message	Cause	Solution
`OSError: 'NAUTILUS_PATH' environment variable is not set`	Using `from_env()` without setting the variable	Set `export NAUTILUS_PATH=/path/to/data` or use `from_uri()` instead
`ImportError: No module named 'pyarrow'`	PyArrow not installed	`pip install pyarrow>=22.0.0`
`FileNotFoundError` on catalog path	Catalog directory does not exist	Ensure the path exists or let `ParquetDataCatalog` create it via `from_uri()`

Compatibility Notes

fsspec backends: The catalog supports any fsspec-compatible filesystem (local, S3, GCS, Azure Blob). Install the appropriate fsspec implementation package (e.g., `s3fs`, `gcsfs`).
DataFusion: Used internally for SQL queries on Parquet data. The `optimize_file_loading` parameter controls whether entire directories are registered (more efficient for many files) or individual files (needed for consolidation operations).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment