Overview
Concrete tool for initializing a Parquet-backed data catalog provided by NautilusTrader.
Description
The ParquetDataCatalog class provides a queryable data catalog that persists trading data to files in Apache Parquet (Arrow) format. Initialization configures the root storage path, the filesystem protocol (local, S3, GCS, Azure, or in-memory), the Arrow serializer, and the Parquet row group size. The class extends BaseDataCatalog (a singleton abstract base class) and uses fsspec for filesystem abstraction, allowing the same API to work transparently across different storage backends. The catalog also supports a Rust-backed query engine for high-performance data retrieval of built-in NautilusTrader data types.
Usage
Import and instantiate ParquetDataCatalog when you need to:
- Create a new data catalog directory for persisting market data.
- Open an existing catalog for querying or appending data.
- Configure cloud storage backends (S3, GCS) for distributed data access.
- Set up an in-memory catalog for unit testing.
Code Reference
Source Location
| Item |
Value
|
| File |
nautilus_trader/persistence/catalog/parquet.py
|
| Lines |
L92-164
|
| Class |
ParquetDataCatalog
|
| Parent Class |
BaseDataCatalog (from nautilus_trader/persistence/catalog/base.py)
|
Signature
class ParquetDataCatalog(BaseDataCatalog):
def __init__(
self,
path: PathLike[str] | str,
fs_protocol: str | None = "file",
fs_storage_options: dict | None = None,
fs_rust_storage_options: dict | None = None,
max_rows_per_group: int = 5_000,
show_query_paths: bool = False,
) -> None: ...
@classmethod
def from_env(cls) -> ParquetDataCatalog: ...
@classmethod
def from_uri(
cls,
uri: str,
fs_storage_options: dict[str, str] | None = None,
fs_rust_storage_options: dict[str, str] | None = None,
) -> ParquetDataCatalog: ...
Import
from nautilus_trader.persistence.catalog import ParquetDataCatalog
I/O Contract
Inputs
| Parameter |
Type |
Default |
Description
|
path |
PathLike[str] ¦ str |
(required) |
Root path for the catalog. Must be an absolute path for local filesystem.
|
fs_protocol |
str ¦ None |
"file" |
Filesystem protocol for fsspec: "file" (local), "s3" (AWS S3), "gcs" (Google Cloud), "memory" (in-memory).
|
fs_storage_options |
dict ¦ None |
None |
Provider-specific storage options (credentials, endpoint URLs, etc.).
|
fs_rust_storage_options |
dict ¦ None |
None |
Storage options specifically for the Rust backend. Defaults to fs_storage_options if not specified.
|
max_rows_per_group |
int |
5000 |
Maximum number of rows per Parquet row group. Controls write batching and query granularity.
|
show_query_paths |
bool |
False |
If True, print globbed query file paths to stdout for debugging.
|
Outputs
| Output |
Type |
Description
|
| Return value |
ParquetDataCatalog |
Initialized catalog instance with configured filesystem, serializer, and path.
|
Key Instance Attributes
| Attribute |
Type |
Description
|
path |
str |
Normalized absolute path to the catalog root directory.
|
fs_protocol |
str |
The resolved filesystem protocol string.
|
fs |
fsspec.AbstractFileSystem |
The initialized fsspec filesystem instance.
|
serializer |
ArrowSerializer |
Serializer for converting NautilusTrader objects to/from Arrow tables.
|
max_rows_per_group |
int |
Configured Parquet row group size limit.
|
Usage Examples
Basic Local Catalog
from nautilus_trader.persistence.catalog import ParquetDataCatalog
# Initialize a local catalog
catalog = ParquetDataCatalog(path="/data/nautilus_catalog")
print(catalog.path) # /data/nautilus_catalog
print(catalog.fs_protocol) # file
Catalog from URI
from nautilus_trader.persistence.catalog import ParquetDataCatalog
# Create from a local URI
catalog = ParquetDataCatalog.from_uri("file:///data/nautilus_catalog")
# Create from an S3 URI with credentials
catalog = ParquetDataCatalog.from_uri(
uri="s3://my-bucket/nautilus_catalog",
fs_storage_options={
"key": "AWS_ACCESS_KEY_ID",
"secret": "AWS_SECRET_ACCESS_KEY",
"endpoint_url": "https://s3.amazonaws.com",
},
)
Catalog from Environment Variable
import os
from nautilus_trader.persistence.catalog import ParquetDataCatalog
# Set NAUTILUS_PATH environment variable
os.environ["NAUTILUS_PATH"] = "/home/user/.nautilus"
# Catalog will be created at /home/user/.nautilus/catalog
catalog = ParquetDataCatalog.from_env()
In-Memory Catalog for Testing
from nautilus_trader.persistence.catalog import ParquetDataCatalog
# Use an in-memory filesystem for unit tests
catalog = ParquetDataCatalog(
path="/test_catalog",
fs_protocol="memory",
)
Catalog with Custom Row Group Size
from nautilus_trader.persistence.catalog import ParquetDataCatalog
# Use larger row groups for bulk historical data
catalog = ParquetDataCatalog(
path="/data/bulk_catalog",
max_rows_per_group=50_000,
)
Related Pages