Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Nautechsystems Nautilus trader Catalog Data Querying

From Leeroopedia


Field Value
Sources GitHub Repository, NautilusTrader Documentation
Domains Data Retrieval, Time-Series Queries, Columnar Storage, Apache Arrow
Last Updated 2026-02-10 12:00 GMT

Overview

Catalog data querying is the process of efficiently retrieving typed trading data from a columnar storage backend by filtering on data type, instrument identity, and time range.

Description

Once market data has been persisted to a Parquet-backed catalog, it must be retrievable in a way that is both fast and type-safe. Catalog data querying provides a structured interface for selecting subsets of stored data based on three primary dimensions:

  • Data type -- The class of data to retrieve (e.g., QuoteTick, TradeTick, Bar, Instrument, OrderBookDelta). Each type is stored in its own directory with a type-specific Arrow schema.
  • Instrument identity -- An optional filter to retrieve data for specific instruments, bar types, or other identifiers. This maps to subdirectory selection within the data type directory.
  • Time range -- Optional start and end timestamps that define the time window of interest. This enables file-level pruning (based on filename-encoded timestamps) and row-level filtering (using Arrow predicate pushdown on the ts_init column).

The querying system also provides convenience methods that pre-select the data type, reducing boilerplate for common access patterns like "give me all trade ticks for BTCUSDT between these dates."

Usage

Catalog data querying is used in the following scenarios:

  • Backtesting data loading -- The backtest engine queries the catalog for specific data types and time ranges to feed into the simulation.
  • Research and analysis -- Researchers query subsets of historical data for strategy development, feature engineering, or statistical analysis.
  • Data validation -- After writing data, query it back to verify correctness of timestamps, counts, and content.
  • Instrument discovery -- Query all available instruments in the catalog to understand what data is available.

Theoretical Basis

Query Resolution Pipeline

A catalog query passes through several stages before returning data:

Stage 1: Directory Discovery
    - Compute the directory path for the requested data type
    - If identifiers are provided, filter to matching subdirectories
    - Otherwise, include all subdirectories for the data type

Stage 2: File-Level Filtering
    - Parse filename timestamps ({start_ns}-{end_ns}.parquet)
    - Exclude files whose time range does not overlap the query range
    - This is a coarse filter that avoids reading irrelevant files entirely

Stage 3: Backend Selection
    - For built-in Nautilus types (OrderBookDelta, QuoteTick, TradeTick, Bar, etc.):
      Use the Rust backend for maximum performance
    - For Instrument subtypes, custom data, or non-local filesystems:
      Use the PyArrow backend

Stage 4: Row-Level Filtering
    - Apply timestamp predicates to filter rows within selected files
    - Apply any additional WHERE clause filters (Rust backend)
    - Deserialize matching rows into typed NautilusTrader objects

Stage 5: Post-Processing
    - For OrderBookDeltas: batch individual deltas into grouped objects
    - For non-Nautilus types: wrap objects in CustomData containers
    - Return the final list of typed objects

File-Level Time Pruning

Given a query range [q_start, q_end] and a file covering [f_start, f_end]:

include_file = (f_end >= q_start) AND (f_start <= q_end)

This simple overlap test eliminates files that are entirely outside the query window, dramatically reducing I/O for narrow time-range queries over large catalogs.

Dual-Backend Architecture

The query system maintains two execution paths:

                        +------------------+
                        |   query(...)     |
                        +--------+---------+
                                 |
                    +------------+-------------+
                    |                          |
            Built-in types              Other types
            (local "file" fs)           (or cloud fs)
                    |                          |
            +-------v--------+       +---------v--------+
            | _query_rust()  |       | _query_pyarrow() |
            +-------+--------+       +---------+--------+
                    |                          |
            DataBackendSession         pyarrow.dataset
            (Rust, zero-copy)          (Python, flexible)
                    |                          |
                    +------------+-------------+
                                 |
                        list[Data | CustomData]

The Rust backend provides superior performance for the most common data types (ticks, bars, deltas) through zero-copy deserialization. The PyArrow backend provides broader compatibility for Instrument subtypes, custom data, cloud filesystems, and explicit file lists.

Convenience Methods

The base catalog class provides convenience methods that pre-select the data type:

instruments(instrument_type, instrument_ids)   ->  query(Instrument subclasses, ...)
trade_ticks(instrument_ids)                    ->  query(TradeTick, ...)
quote_ticks(instrument_ids)                    ->  query(QuoteTick, ...)
bars(bar_types, instrument_ids)                ->  query(Bar, ...)
order_book_deltas(instrument_ids, batched)     ->  query(OrderBookDelta/Deltas, ...)
order_book_depth10(instrument_ids)             ->  query(OrderBookDepth10, ...)
instrument_status(instrument_ids)              ->  query(InstrumentStatus, ...)
funding_rates(instrument_ids)                  ->  query(FundingRateUpdate, ...)

Each convenience method delegates to the generic query method with the appropriate data class parameter.

Subclass Querying

For polymorphic types like Instrument (which has subclasses CurrencyPair, CryptoPerpetual, Equity, FuturesContract, etc.), the query system iterates over all known subclasses and merges results:

function query_subclasses(base_cls, identifiers):
    results = []
    for subclass in base_cls.__subclasses__():
        try:
            results.extend(query(subclass, identifiers))
        except NotFound:
            continue
    return results

This ensures that a call to catalog.instruments() returns all instrument types without requiring the caller to know which specific subclass was used during writing.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment