Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Nautechsystems Nautilus trader Parquet Catalog Initialization

From Leeroopedia


Field Value
Sources GitHub Repository, NautilusTrader Documentation
Domains Data Storage, Columnar Formats, Time-Series Data, Apache Arrow
Last Updated 2026-02-10 12:00 GMT

Overview

Parquet catalog initialization is the process of establishing a structured, queryable data store backed by Apache Parquet columnar files for efficient persistence and retrieval of time-series market data.

Description

Time-series market data -- tick data, OHLCV bars, order book snapshots, instrument definitions -- accumulates at high volume and must be stored in a format that supports both efficient sequential writes and fast analytical queries over arbitrary time ranges. Apache Parquet, a columnar storage format built on Apache Arrow, is well suited for this purpose because it provides:

  • Columnar compression -- Data is stored column-by-column, allowing type-specific compression algorithms (dictionary encoding, delta encoding, run-length encoding) that dramatically reduce file sizes for repetitive fields like instrument IDs and timestamps.
  • Predicate pushdown -- Query engines can skip entire row groups based on column statistics (min/max values), enabling fast time-range filtering without reading the full dataset.
  • Schema evolution -- Parquet files carry embedded schemas, allowing the catalog to evolve its data model over time without breaking backward compatibility.
  • Cross-language interoperability -- Arrow-based formats are readable from Python, Rust, C++, Java, and other languages, enabling both Python-based analysis and Rust-based high-performance query paths.

Catalog initialization establishes the root directory structure, configures the filesystem backend (local, S3, GCS, Azure, or in-memory), and prepares the serialization pipeline. The result is a persistent, organized data store where each data type and instrument combination maps to a dedicated subdirectory of Parquet files.

Usage

Parquet catalog initialization is required before any of the following operations:

  • Writing data -- Persisting loaded market data (from Databento, Tardis, or other sources) to disk.
  • Querying data -- Retrieving time-filtered subsets of historical data for backtesting.
  • Backtest configuration -- Providing a catalog reference to the backtesting engine so it can stream data during simulation.
  • Research and analysis -- Loading data into pandas DataFrames or directly into NautilusTrader objects for exploratory analysis.

Theoretical Basis

Directory Layout

A Parquet data catalog organizes files in a hierarchical directory structure:

{catalog_root}/
    {DataTypeName}/
        {instrument_id_or_bar_type}/
            {start_timestamp}-{end_timestamp}.parquet
            {start_timestamp}-{end_timestamp}.parquet
            ...

For example:

/data/catalog/
    QuoteTick/
        BTCUSDT.BINANCE/
            1704067200000000000-1704153600000000000.parquet
        ETHUSDT.BINANCE/
            1704067200000000000-1704153600000000000.parquet
    Bar/
        BTCUSDT.BINANCE-1-MINUTE-LAST-EXTERNAL/
            1704067200000000000-1704153600000000000.parquet
    CurrencyPair/
        BTCUSDT.BINANCE/
            0-0.parquet

Each Parquet file covers a specific, non-overlapping time interval. Filenames encode the start and end nanosecond timestamps, enabling fast file-level filtering before any data is read.

Filesystem Abstraction

The catalog uses fsspec (filesystem spec) to abstract storage backends:

Catalog(path, fs_protocol, fs_storage_options)
    |
    v
fsspec.filesystem(protocol, **storage_options)
    |
    +-- "file"   -> local filesystem
    +-- "s3"     -> Amazon S3 (requires aws credentials)
    +-- "gcs"    -> Google Cloud Storage
    +-- "memory" -> in-memory filesystem (for testing)
    +-- ...      -> any fsspec-compatible backend

This abstraction allows the same catalog API to operate transparently against local disk, cloud object storage, or in-memory filesystems.

Initialization Pseudocode

function initialize_catalog(path, fs_protocol="file", storage_options={}):
    filesystem = create_filesystem(fs_protocol, storage_options)
    path = normalize_path(path, fs_protocol)

    serializer = ArrowSerializer()       # Handles object <-> Arrow conversion
    max_rows_per_group = 5000            # Parquet row group size

    return Catalog(path, filesystem, serializer, max_rows_per_group)

Row Group Sizing

Parquet files are internally organized into row groups. The row group size affects:

  • Memory usage -- Larger row groups require more memory during writes but allow better compression.
  • Query granularity -- Smaller row groups enable finer-grained predicate pushdown but increase metadata overhead.
  • Write throughput -- The writer may split large incoming batches into multiple row groups to stay within the configured maximum.

A default of 5,000 rows per group provides a reasonable balance for typical tick-level market data.

URI-Based Construction

Catalogs can also be constructed from a URI string, which encodes both the protocol and path:

"file:///data/catalog"          -> local path /data/catalog
"s3://my-bucket/catalog"        -> S3 bucket
"gcs://my-bucket/catalog"       -> Google Cloud Storage
"/data/catalog"                 -> assumed local (resolved to absolute URI)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment