Principle:Nautechsystems Nautilus trader Parquet Catalog Initialization
| Field | Value |
|---|---|
| Sources | GitHub Repository, NautilusTrader Documentation |
| Domains | Data Storage, Columnar Formats, Time-Series Data, Apache Arrow |
| Last Updated | 2026-02-10 12:00 GMT |
Overview
Parquet catalog initialization is the process of establishing a structured, queryable data store backed by Apache Parquet columnar files for efficient persistence and retrieval of time-series market data.
Description
Time-series market data -- tick data, OHLCV bars, order book snapshots, instrument definitions -- accumulates at high volume and must be stored in a format that supports both efficient sequential writes and fast analytical queries over arbitrary time ranges. Apache Parquet, a columnar storage format built on Apache Arrow, is well suited for this purpose because it provides:
- Columnar compression -- Data is stored column-by-column, allowing type-specific compression algorithms (dictionary encoding, delta encoding, run-length encoding) that dramatically reduce file sizes for repetitive fields like instrument IDs and timestamps.
- Predicate pushdown -- Query engines can skip entire row groups based on column statistics (min/max values), enabling fast time-range filtering without reading the full dataset.
- Schema evolution -- Parquet files carry embedded schemas, allowing the catalog to evolve its data model over time without breaking backward compatibility.
- Cross-language interoperability -- Arrow-based formats are readable from Python, Rust, C++, Java, and other languages, enabling both Python-based analysis and Rust-based high-performance query paths.
Catalog initialization establishes the root directory structure, configures the filesystem backend (local, S3, GCS, Azure, or in-memory), and prepares the serialization pipeline. The result is a persistent, organized data store where each data type and instrument combination maps to a dedicated subdirectory of Parquet files.
Usage
Parquet catalog initialization is required before any of the following operations:
- Writing data -- Persisting loaded market data (from Databento, Tardis, or other sources) to disk.
- Querying data -- Retrieving time-filtered subsets of historical data for backtesting.
- Backtest configuration -- Providing a catalog reference to the backtesting engine so it can stream data during simulation.
- Research and analysis -- Loading data into pandas DataFrames or directly into NautilusTrader objects for exploratory analysis.
Theoretical Basis
Directory Layout
A Parquet data catalog organizes files in a hierarchical directory structure:
{catalog_root}/
{DataTypeName}/
{instrument_id_or_bar_type}/
{start_timestamp}-{end_timestamp}.parquet
{start_timestamp}-{end_timestamp}.parquet
...
For example:
/data/catalog/
QuoteTick/
BTCUSDT.BINANCE/
1704067200000000000-1704153600000000000.parquet
ETHUSDT.BINANCE/
1704067200000000000-1704153600000000000.parquet
Bar/
BTCUSDT.BINANCE-1-MINUTE-LAST-EXTERNAL/
1704067200000000000-1704153600000000000.parquet
CurrencyPair/
BTCUSDT.BINANCE/
0-0.parquet
Each Parquet file covers a specific, non-overlapping time interval. Filenames encode the start and end nanosecond timestamps, enabling fast file-level filtering before any data is read.
Filesystem Abstraction
The catalog uses fsspec (filesystem spec) to abstract storage backends:
Catalog(path, fs_protocol, fs_storage_options)
|
v
fsspec.filesystem(protocol, **storage_options)
|
+-- "file" -> local filesystem
+-- "s3" -> Amazon S3 (requires aws credentials)
+-- "gcs" -> Google Cloud Storage
+-- "memory" -> in-memory filesystem (for testing)
+-- ... -> any fsspec-compatible backend
This abstraction allows the same catalog API to operate transparently against local disk, cloud object storage, or in-memory filesystems.
Initialization Pseudocode
function initialize_catalog(path, fs_protocol="file", storage_options={}):
filesystem = create_filesystem(fs_protocol, storage_options)
path = normalize_path(path, fs_protocol)
serializer = ArrowSerializer() # Handles object <-> Arrow conversion
max_rows_per_group = 5000 # Parquet row group size
return Catalog(path, filesystem, serializer, max_rows_per_group)
Row Group Sizing
Parquet files are internally organized into row groups. The row group size affects:
- Memory usage -- Larger row groups require more memory during writes but allow better compression.
- Query granularity -- Smaller row groups enable finer-grained predicate pushdown but increase metadata overhead.
- Write throughput -- The writer may split large incoming batches into multiple row groups to stay within the configured maximum.
A default of 5,000 rows per group provides a reasonable balance for typical tick-level market data.
URI-Based Construction
Catalogs can also be constructed from a URI string, which encodes both the protocol and path:
"file:///data/catalog" -> local path /data/catalog
"s3://my-bucket/catalog" -> S3 bucket
"gcs://my-bucket/catalog" -> Google Cloud Storage
"/data/catalog" -> assumed local (resolved to absolute URI)