Principle:Nautechsystems Nautilus trader Catalog Data Writing
| Field | Value |
|---|---|
| Sources | GitHub Repository, NautilusTrader Documentation |
| Domains | Data Persistence, Columnar Storage, Apache Arrow, Time-Series Data |
| Last Updated | 2026-02-10 12:00 GMT |
Overview
Catalog data writing is the process of persisting typed trading data objects -- market data, instrument definitions, and events -- to a columnar storage backend in an organized, time-partitioned, and type-safe manner.
Description
After market data has been loaded from external sources and normalized into typed objects, it must be written to persistent storage for future retrieval. Catalog data writing handles the serialization of in-memory data objects to Apache Parquet files, organized by data type and instrument identity, with strict guarantees about temporal ordering and interval disjointness.
The key problems that catalog data writing solves include:
- Type-aware serialization -- Each data type (QuoteTick, TradeTick, Bar, Instrument, etc.) has a distinct Arrow schema. The writer must select the correct serializer and directory path for each type.
- Instrument partitioning -- Data for different instruments is written to separate subdirectories, enabling efficient per-instrument queries without scanning unrelated data.
- Temporal ordering enforcement -- All data within a single write operation must be monotonically non-decreasing by initialization timestamp (ts_init). This invariant ensures that downstream consumers can rely on sorted data.
- Disjoint interval validation -- Each Parquet file covers a specific time interval encoded in its filename. The writer validates that new files do not create overlapping intervals with existing files, preventing duplicate or ambiguous data.
- Automatic grouping -- A heterogeneous list of data objects is automatically sorted, grouped by type and identifier, and written as separate chunks to the appropriate locations.
Usage
Catalog data writing is used in the following scenarios:
- Data pipeline ingestion -- After loading data from Databento DBN files, Tardis CSV files, or other sources, write it to the catalog for persistent storage.
- Incremental updates -- Append new time-range chunks to an existing catalog without overwriting or duplicating data.
- Multi-instrument catalogs -- Write data for many instruments in a single call; the writer automatically routes each instrument's data to the correct subdirectory.
- Backtest data preparation -- Build a complete catalog of historical data that the backtesting engine can query efficiently.
Theoretical Basis
Write Pipeline
The data writing process follows a multi-stage pipeline:
Stage 1: Classification
- For each data object, determine:
(a) class name (e.g., "QuoteTick", "Bar", "CurrencyPair")
(b) identifier (instrument_id, bar_type, or None for custom data)
Stage 2: Sorting and Grouping
- Sort all objects by (class_name, identifier)
- Group into contiguous chunks of the same type and identifier
Stage 3: Validation
- For each chunk, verify monotonic non-decreasing ts_init ordering
- Check that the new time interval is disjoint from existing files
Stage 4: Serialization
- Convert each chunk to an Arrow Table using the type-specific serializer
- Write the table to a Parquet file at the computed path
Stage 5: File Naming
- Name each file as "{start_ns}-{end_ns}.parquet"
- start_ns and end_ns are nanosecond Unix timestamps
Directory Path Construction
function make_path(data_cls, identifier):
class_name = filename_from_class(data_cls)
if identifier is not None:
safe_id = uri_safe(identifier) # Escape special characters
return "{catalog_root}/{class_name}/{safe_id}/"
else:
return "{catalog_root}/{class_name}/"
Monotonicity Invariant
For a sequence of data objects d_1, d_2, ..., d_n in a single chunk:
d_i.ts_init <= d_{i+1}.ts_init for all i in [1, n-1]
If this invariant is violated, the writer raises an error with guidance to sort the data before writing.
Disjoint Interval Check
Given existing file intervals [s_1, e_1], [s_2, e_2], ... and a new interval [s_new, e_new]:
function are_intervals_disjoint(intervals):
sort intervals by start time
for each consecutive pair (a, b):
if a.end >= b.start:
return False # Overlap detected
return True
If the new interval would create an overlap, the writer raises a ValueError. This check can be skipped with skip_disjoint_check=True when the caller guarantees non-overlapping data.
Identifier Resolution
The identifier for each data object is resolved according to a priority hierarchy:
function resolve_identifier(obj):
if obj is Instrument:
return obj.id.value # e.g., "BTCUSDT.BINANCE"
elif obj has bar_type:
return str(obj.bar_type) # e.g., "BTCUSDT.BINANCE-1-MINUTE-LAST-EXTERNAL"
elif obj has instrument_id:
return obj.instrument_id.value # e.g., "ETHUSDT.BINANCE"
else:
return None # Custom data without instrument context