Environment:Nautechsystems Nautilus trader Arrow Parquet Serialization
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Data_Persistence |
| Last Updated | 2026-02-10 08:30 GMT |
Overview
Apache Arrow (PyArrow >= 22.0.0) and fsspec (>= 2025.2.0) environment for columnar data serialization, Parquet catalog storage, and cloud filesystem access.
Description
NautilusTrader uses Apache Arrow as its primary serialization format for market data persistence. The `ParquetDataCatalog` stores trade ticks, quote ticks, bars, and other data types in Parquet files using PyArrow. The `fsspec` library provides an abstract filesystem layer that allows the catalog to read/write from local disk, S3, GCS, or any fsspec-compatible storage backend. DataFusion is used internally for efficient SQL-based querying of Parquet files.
Usage
This environment is required whenever you use the ParquetDataCatalog for data storage and retrieval. This includes writing historical data to catalogs, querying data for backtesting via `BacktestDataConfig`, and any operation that reads or writes `.parquet` files. It is also required by the `BacktestNode` which loads data from catalog via configuration.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux, macOS, or Windows | All platforms supported |
| Disk | Varies by dataset size | SSD recommended for large catalogs; Parquet is columnar and compact |
| Memory | Sufficient for dataset loading | Use `max_rows_per_group` to control memory during writes |
Dependencies
Python Packages
- `pyarrow` >= 22.0.0
- `fsspec` >= 2025.2.0, <= 2026.1.0
- `pandas` >= 2.3.3, < 3.0.0 (for DataFrame integration)
- `numpy` >= 1.26.4
Credentials
The following environment variables may be required:
- `NAUTILUS_PATH`: Base path for catalog storage. Required when using `ParquetDataCatalog.from_env()`. The catalog is created at `$NAUTILUS_PATH/catalog`.
- Cloud storage credentials (e.g., `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`) when using fsspec with S3 or other cloud backends.
Quick Install
# Core dependencies (included with nautilus_trader)
pip install pyarrow>=22.0.0 fsspec>=2025.2.0 pandas>=2.3.3
# For S3 catalog access
pip install s3fs
# For GCS catalog access
pip install gcsfs
Code Evidence
NAUTILUS_PATH environment variable check from `persistence/catalog/parquet.py:182-184`:
if _NAUTILUS_PATH not in os.environ:
raise OSError(f"'{_NAUTILUS_PATH}' environment variable is not set.")
return cls.from_uri(os.environ[_NAUTILUS_PATH] + "/catalog")
max_rows_per_group default from `persistence/catalog/parquet.py:111-114`:
max_rows_per_group : int, default 5000
The maximum number of rows per group. If the value is greater than 0,
then the dataset writer may split up large incoming batches into
multiple row groups.
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `OSError: 'NAUTILUS_PATH' environment variable is not set` | Using `from_env()` without setting the variable | Set `export NAUTILUS_PATH=/path/to/data` or use `from_uri()` instead |
| `ImportError: No module named 'pyarrow'` | PyArrow not installed | `pip install pyarrow>=22.0.0` |
| `FileNotFoundError` on catalog path | Catalog directory does not exist | Ensure the path exists or let `ParquetDataCatalog` create it via `from_uri()` |
Compatibility Notes
- fsspec backends: The catalog supports any fsspec-compatible filesystem (local, S3, GCS, Azure Blob). Install the appropriate fsspec implementation package (e.g., `s3fs`, `gcsfs`).
- DataFusion: Used internally for SQL queries on Parquet data. The `optimize_file_loading` parameter controls whether entire directories are registered (more efficient for many files) or individual files (needed for consolidation operations).