Principle:Nautechsystems Nautilus trader Data Configuration Schema
| Field | Value |
|---|---|
| sources | https://github.com/nautechsystems/nautilus_trader, https://nautilustrader.io/docs/ |
| domains | backtesting, data-management, configuration-management |
| last_updated | 2026-02-10 12:00 GMT |
Overview
A Data Configuration Schema defines a declarative specification for describing which market data sets to load, how to filter them, and over what time range to apply them during a backtest simulation.
Description
A backtest engine requires historical market data -- quotes, trades, bars, order book snapshots -- to drive its simulation. Rather than having users write imperative data-loading code that manually opens catalogs, constructs queries, and feeds data into the engine, a data configuration schema captures all data-sourcing parameters in a single, serializable structure.
This approach solves several fundamental problems:
- Decoupling data selection from engine logic -- The schema describes what data is needed (type, instrument, time range) while the engine handles how to load and replay it.
- Catalog abstraction -- The schema references a data catalog by path and optional filesystem protocol, supporting local, S3, or other storage backends transparently.
- Multi-data-type composition -- A single backtest can combine multiple data configuration schemas, each specifying a different data type (quotes, bars, order book deltas) for different instruments or time ranges.
- Time windowing -- Start and end times allow precise control over the data window without modifying the underlying catalog.
- Filtering -- Additional filter expressions (PyArrow-compatible) enable fine-grained selection beyond instrument and time.
Key dimensions of a data configuration schema:
- Catalog location -- Path to the data catalog and optional filesystem protocol/storage options.
- Data class -- The fully qualified type name of the data to load (e.g., quotes, trades, bars).
- Instrument identification -- One or more instrument IDs, or bar type specifications, to select from the catalog.
- Time boundaries -- Start and end timestamps (ISO 8601 or UNIX nanoseconds) bounding the data window.
- Filter expression -- An optional PyArrow-compatible filter for additional row-level filtering.
- Client ID -- Optional client association for the data stream.
Usage
Use the Data Configuration Schema principle whenever you need to:
- Specify which historical data sets to load for a backtest run.
- Combine multiple data types (bars, quotes, order book data) for the same or different instruments.
- Apply time window boundaries to limit data to a specific period.
- Work with data catalogs on different storage backends (local, S3, GCS).
- Ensure data selection parameters are reproducible and version-controllable.
Theoretical Basis
The Data Configuration Schema follows the query specification pattern, where a data request is described as a structured object that can be validated, composed, and executed later. This is analogous to building a database query as a data structure before execution.
Pseudocode for a generic data configuration schema:
DataConfig:
catalog_path : string # Path to data catalog
data_cls : string # Fully qualified data type name
instrument_id : InstrumentId | None # Single instrument filter
instrument_ids : list[InstrumentId] | None # Multi-instrument filter
start_time : datetime | int | None # Start of data window
end_time : datetime | int | None # End of data window
filter_expr : string | None # Additional filter expression
bar_spec : string | None # Bar specification (for bar data)
bar_types : list[string] | None # Explicit bar type list
fs_protocol : string | None # Filesystem protocol
fs_options : map | None # Filesystem storage options
Query construction pseudocode:
BUILD_QUERY(config):
identifiers = []
IF config.data_cls is Bar:
IF config.bar_types is not None:
identifiers = config.bar_types
ELSE IF config.instrument_id AND config.bar_spec:
identifiers = ["{instrument_id}-{bar_spec}-EXTERNAL"]
ELSE IF config.instrument_ids AND config.bar_spec:
FOR EACH id IN config.instrument_ids:
identifiers.append("{id}-{bar_spec}-EXTERNAL")
IF identifiers is empty:
IF config.instrument_id:
identifiers = [config.instrument_id]
ELSE IF config.instrument_ids:
identifiers = config.instrument_ids
RETURN Query(
data_cls = resolve(config.data_cls),
identifiers = identifiers,
start = config.start_time,
end = config.end_time,
filter_expr = parse(config.filter_expr),
)
The schema is consumed by a data loading layer that translates it into catalog queries, resolves the data type, and streams the results into the backtest engine.