Principle:Nautechsystems Nautilus trader Catalog Data Writing

Field	Value
Sources	GitHub Repository, NautilusTrader Documentation
Domains	Data Persistence, Columnar Storage, Apache Arrow, Time-Series Data
Last Updated	2026-02-10 12:00 GMT

Overview

Catalog data writing is the process of persisting typed trading data objects -- market data, instrument definitions, and events -- to a columnar storage backend in an organized, time-partitioned, and type-safe manner.

Description

After market data has been loaded from external sources and normalized into typed objects, it must be written to persistent storage for future retrieval. Catalog data writing handles the serialization of in-memory data objects to Apache Parquet files, organized by data type and instrument identity, with strict guarantees about temporal ordering and interval disjointness.

The key problems that catalog data writing solves include:

Type-aware serialization -- Each data type (QuoteTick, TradeTick, Bar, Instrument, etc.) has a distinct Arrow schema. The writer must select the correct serializer and directory path for each type.
Instrument partitioning -- Data for different instruments is written to separate subdirectories, enabling efficient per-instrument queries without scanning unrelated data.
Temporal ordering enforcement -- All data within a single write operation must be monotonically non-decreasing by initialization timestamp (ts_init). This invariant ensures that downstream consumers can rely on sorted data.
Disjoint interval validation -- Each Parquet file covers a specific time interval encoded in its filename. The writer validates that new files do not create overlapping intervals with existing files, preventing duplicate or ambiguous data.
Automatic grouping -- A heterogeneous list of data objects is automatically sorted, grouped by type and identifier, and written as separate chunks to the appropriate locations.

Usage

Catalog data writing is used in the following scenarios:

Data pipeline ingestion -- After loading data from Databento DBN files, Tardis CSV files, or other sources, write it to the catalog for persistent storage.
Incremental updates -- Append new time-range chunks to an existing catalog without overwriting or duplicating data.
Multi-instrument catalogs -- Write data for many instruments in a single call; the writer automatically routes each instrument's data to the correct subdirectory.
Backtest data preparation -- Build a complete catalog of historical data that the backtesting engine can query efficiently.

Theoretical Basis

Write Pipeline

The data writing process follows a multi-stage pipeline:

Stage 1: Classification
    - For each data object, determine:
      (a) class name (e.g., "QuoteTick", "Bar", "CurrencyPair")
      (b) identifier (instrument_id, bar_type, or None for custom data)

Stage 2: Sorting and Grouping
    - Sort all objects by (class_name, identifier)
    - Group into contiguous chunks of the same type and identifier

Stage 3: Validation
    - For each chunk, verify monotonic non-decreasing ts_init ordering
    - Check that the new time interval is disjoint from existing files

Stage 4: Serialization
    - Convert each chunk to an Arrow Table using the type-specific serializer
    - Write the table to a Parquet file at the computed path

Stage 5: File Naming
    - Name each file as "{start_ns}-{end_ns}.parquet"
    - start_ns and end_ns are nanosecond Unix timestamps

Directory Path Construction

function make_path(data_cls, identifier):
    class_name = filename_from_class(data_cls)

    if identifier is not None:
        safe_id = uri_safe(identifier)  # Escape special characters
        return "{catalog_root}/{class_name}/{safe_id}/"
    else:
        return "{catalog_root}/{class_name}/"

Monotonicity Invariant

For a sequence of data objects d_1, d_2, ..., d_n in a single chunk:

d_i.ts_init <= d_{i+1}.ts_init  for all i in [1, n-1]

If this invariant is violated, the writer raises an error with guidance to sort the data before writing.

Disjoint Interval Check

Given existing file intervals [s_1, e_1], [s_2, e_2], ... and a new interval [s_new, e_new]:

function are_intervals_disjoint(intervals):
    sort intervals by start time
    for each consecutive pair (a, b):
        if a.end >= b.start:
            return False  # Overlap detected
    return True

If the new interval would create an overlap, the writer raises a ValueError. This check can be skipped with skip_disjoint_check=True when the caller guarantees non-overlapping data.

Identifier Resolution

The identifier for each data object is resolved according to a priority hierarchy:

function resolve_identifier(obj):
    if obj is Instrument:
        return obj.id.value                    # e.g., "BTCUSDT.BINANCE"
    elif obj has bar_type:
        return str(obj.bar_type)               # e.g., "BTCUSDT.BINANCE-1-MINUTE-LAST-EXTERNAL"
    elif obj has instrument_id:
        return obj.instrument_id.value         # e.g., "ETHUSDT.BINANCE"
    else:
        return None                            # Custom data without instrument context

Related Pages

Implementation:Nautechsystems_Nautilus_trader_ParquetDataCatalog_Write_Data

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment