Principle:Nautechsystems Nautilus trader Catalog Data Querying
| Field | Value |
|---|---|
| Sources | GitHub Repository, NautilusTrader Documentation |
| Domains | Data Retrieval, Time-Series Queries, Columnar Storage, Apache Arrow |
| Last Updated | 2026-02-10 12:00 GMT |
Overview
Catalog data querying is the process of efficiently retrieving typed trading data from a columnar storage backend by filtering on data type, instrument identity, and time range.
Description
Once market data has been persisted to a Parquet-backed catalog, it must be retrievable in a way that is both fast and type-safe. Catalog data querying provides a structured interface for selecting subsets of stored data based on three primary dimensions:
- Data type -- The class of data to retrieve (e.g., QuoteTick, TradeTick, Bar, Instrument, OrderBookDelta). Each type is stored in its own directory with a type-specific Arrow schema.
- Instrument identity -- An optional filter to retrieve data for specific instruments, bar types, or other identifiers. This maps to subdirectory selection within the data type directory.
- Time range -- Optional start and end timestamps that define the time window of interest. This enables file-level pruning (based on filename-encoded timestamps) and row-level filtering (using Arrow predicate pushdown on the ts_init column).
The querying system also provides convenience methods that pre-select the data type, reducing boilerplate for common access patterns like "give me all trade ticks for BTCUSDT between these dates."
Usage
Catalog data querying is used in the following scenarios:
- Backtesting data loading -- The backtest engine queries the catalog for specific data types and time ranges to feed into the simulation.
- Research and analysis -- Researchers query subsets of historical data for strategy development, feature engineering, or statistical analysis.
- Data validation -- After writing data, query it back to verify correctness of timestamps, counts, and content.
- Instrument discovery -- Query all available instruments in the catalog to understand what data is available.
Theoretical Basis
Query Resolution Pipeline
A catalog query passes through several stages before returning data:
Stage 1: Directory Discovery
- Compute the directory path for the requested data type
- If identifiers are provided, filter to matching subdirectories
- Otherwise, include all subdirectories for the data type
Stage 2: File-Level Filtering
- Parse filename timestamps ({start_ns}-{end_ns}.parquet)
- Exclude files whose time range does not overlap the query range
- This is a coarse filter that avoids reading irrelevant files entirely
Stage 3: Backend Selection
- For built-in Nautilus types (OrderBookDelta, QuoteTick, TradeTick, Bar, etc.):
Use the Rust backend for maximum performance
- For Instrument subtypes, custom data, or non-local filesystems:
Use the PyArrow backend
Stage 4: Row-Level Filtering
- Apply timestamp predicates to filter rows within selected files
- Apply any additional WHERE clause filters (Rust backend)
- Deserialize matching rows into typed NautilusTrader objects
Stage 5: Post-Processing
- For OrderBookDeltas: batch individual deltas into grouped objects
- For non-Nautilus types: wrap objects in CustomData containers
- Return the final list of typed objects
File-Level Time Pruning
Given a query range [q_start, q_end] and a file covering [f_start, f_end]:
include_file = (f_end >= q_start) AND (f_start <= q_end)
This simple overlap test eliminates files that are entirely outside the query window, dramatically reducing I/O for narrow time-range queries over large catalogs.
Dual-Backend Architecture
The query system maintains two execution paths:
+------------------+
| query(...) |
+--------+---------+
|
+------------+-------------+
| |
Built-in types Other types
(local "file" fs) (or cloud fs)
| |
+-------v--------+ +---------v--------+
| _query_rust() | | _query_pyarrow() |
+-------+--------+ +---------+--------+
| |
DataBackendSession pyarrow.dataset
(Rust, zero-copy) (Python, flexible)
| |
+------------+-------------+
|
list[Data | CustomData]
The Rust backend provides superior performance for the most common data types (ticks, bars, deltas) through zero-copy deserialization. The PyArrow backend provides broader compatibility for Instrument subtypes, custom data, cloud filesystems, and explicit file lists.
Convenience Methods
The base catalog class provides convenience methods that pre-select the data type:
instruments(instrument_type, instrument_ids) -> query(Instrument subclasses, ...)
trade_ticks(instrument_ids) -> query(TradeTick, ...)
quote_ticks(instrument_ids) -> query(QuoteTick, ...)
bars(bar_types, instrument_ids) -> query(Bar, ...)
order_book_deltas(instrument_ids, batched) -> query(OrderBookDelta/Deltas, ...)
order_book_depth10(instrument_ids) -> query(OrderBookDepth10, ...)
instrument_status(instrument_ids) -> query(InstrumentStatus, ...)
funding_rates(instrument_ids) -> query(FundingRateUpdate, ...)
Each convenience method delegates to the generic query method with the appropriate data class parameter.
Subclass Querying
For polymorphic types like Instrument (which has subclasses CurrencyPair, CryptoPerpetual, Equity, FuturesContract, etc.), the query system iterates over all known subclasses and merges results:
function query_subclasses(base_cls, identifiers):
results = []
for subclass in base_cls.__subclasses__():
try:
results.extend(query(subclass, identifiers))
except NotFound:
continue
return results
This ensures that a call to catalog.instruments() returns all instrument types without requiring the caller to know which specific subclass was used during writing.