Principle:Nautechsystems Nautilus trader Data Configuration Schema

Field	Value
sources	https://github.com/nautechsystems/nautilus_trader, https://nautilustrader.io/docs/
domains	backtesting, data-management, configuration-management
last_updated	2026-02-10 12:00 GMT

Overview

A Data Configuration Schema defines a declarative specification for describing which market data sets to load, how to filter them, and over what time range to apply them during a backtest simulation.

Description

A backtest engine requires historical market data -- quotes, trades, bars, order book snapshots -- to drive its simulation. Rather than having users write imperative data-loading code that manually opens catalogs, constructs queries, and feeds data into the engine, a data configuration schema captures all data-sourcing parameters in a single, serializable structure.

This approach solves several fundamental problems:

Decoupling data selection from engine logic -- The schema describes what data is needed (type, instrument, time range) while the engine handles how to load and replay it.
Catalog abstraction -- The schema references a data catalog by path and optional filesystem protocol, supporting local, S3, or other storage backends transparently.
Multi-data-type composition -- A single backtest can combine multiple data configuration schemas, each specifying a different data type (quotes, bars, order book deltas) for different instruments or time ranges.
Time windowing -- Start and end times allow precise control over the data window without modifying the underlying catalog.
Filtering -- Additional filter expressions (PyArrow-compatible) enable fine-grained selection beyond instrument and time.

Key dimensions of a data configuration schema:

Catalog location -- Path to the data catalog and optional filesystem protocol/storage options.
Data class -- The fully qualified type name of the data to load (e.g., quotes, trades, bars).
Instrument identification -- One or more instrument IDs, or bar type specifications, to select from the catalog.
Time boundaries -- Start and end timestamps (ISO 8601 or UNIX nanoseconds) bounding the data window.
Filter expression -- An optional PyArrow-compatible filter for additional row-level filtering.
Client ID -- Optional client association for the data stream.

Usage

Use the Data Configuration Schema principle whenever you need to:

Specify which historical data sets to load for a backtest run.
Combine multiple data types (bars, quotes, order book data) for the same or different instruments.
Apply time window boundaries to limit data to a specific period.
Work with data catalogs on different storage backends (local, S3, GCS).
Ensure data selection parameters are reproducible and version-controllable.

Theoretical Basis

The Data Configuration Schema follows the query specification pattern, where a data request is described as a structured object that can be validated, composed, and executed later. This is analogous to building a database query as a data structure before execution.

Pseudocode for a generic data configuration schema:

DataConfig:
    catalog_path    : string               # Path to data catalog
    data_cls        : string               # Fully qualified data type name
    instrument_id   : InstrumentId | None   # Single instrument filter
    instrument_ids  : list[InstrumentId] | None  # Multi-instrument filter
    start_time      : datetime | int | None  # Start of data window
    end_time        : datetime | int | None  # End of data window
    filter_expr     : string | None         # Additional filter expression
    bar_spec        : string | None         # Bar specification (for bar data)
    bar_types       : list[string] | None   # Explicit bar type list
    fs_protocol     : string | None         # Filesystem protocol
    fs_options      : map | None            # Filesystem storage options

Query construction pseudocode:

BUILD_QUERY(config):
    identifiers = []

    IF config.data_cls is Bar:
        IF config.bar_types is not None:
            identifiers = config.bar_types
        ELSE IF config.instrument_id AND config.bar_spec:
            identifiers = ["{instrument_id}-{bar_spec}-EXTERNAL"]
        ELSE IF config.instrument_ids AND config.bar_spec:
            FOR EACH id IN config.instrument_ids:
                identifiers.append("{id}-{bar_spec}-EXTERNAL")

    IF identifiers is empty:
        IF config.instrument_id:
            identifiers = [config.instrument_id]
        ELSE IF config.instrument_ids:
            identifiers = config.instrument_ids

    RETURN Query(
        data_cls    = resolve(config.data_cls),
        identifiers = identifiers,
        start       = config.start_time,
        end         = config.end_time,
        filter_expr = parse(config.filter_expr),
    )

The schema is consumed by a data loading layer that translates it into catalog queries, resolves the data type, and streams the results into the backtest engine.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment