Principle:Nautechsystems Nautilus trader External Data Loading
| Field | Value |
|---|---|
| Sources | GitHub Repository, NautilusTrader Documentation |
| Domains | Market Data Ingestion, Data Formats, ETL Pipelines |
| Last Updated | 2026-02-10 12:00 GMT |
Overview
External data loading is the process of ingesting market data from third-party providers and proprietary file formats into a normalized, typed object model suitable for backtesting, analysis, and live trading.
Description
Financial market data originates from a variety of vendors, exchanges, and data providers, each with their own proprietary file formats, encoding schemes, and schema conventions. Before this data can be used in a trading system, it must be decoded, validated, mapped to a canonical type system, and optionally converted between internal representations. External data loading addresses this translation layer.
The key challenges that external data loading solves include:
- Format heterogeneity -- Vendors deliver data in binary formats (e.g., Databento's DBN encoding), CSV files (e.g., Tardis), proprietary APIs, or database dumps. Each requires a dedicated parser.
- Schema mapping -- Raw vendor schemas (MBO, MBP-1, MBP-10, TRADES, OHLCV, DEFINITION, etc.) must be mapped to the system's canonical data types (OrderBookDelta, QuoteTick, TradeTick, Bar, Instrument, etc.).
- Symbology translation -- Vendor-specific instrument identifiers must be mapped to the system's internal InstrumentId format (symbol + venue).
- Precision handling -- Price and size precision must be correctly inferred or specified to avoid floating-point artifacts.
- Performance -- Loading large historical datasets (millions of rows) requires efficient deserialization, ideally leveraging compiled code paths (Rust/Cython) rather than pure Python.
Usage
External data loading is used in the following contexts:
- Historical data preparation -- Converting vendor-supplied data files into typed objects before writing them to a data catalog.
- Backtesting pipelines -- Loading data files directly for immediate consumption by a backtest engine.
- Data validation -- Inspecting loaded data objects to verify correctness of symbology, precision, and timestamp ordering.
- Multi-vendor workflows -- Combining data from multiple providers (e.g., Databento for US equities, Tardis for crypto) into a unified catalog.
Theoretical Basis
Data Loading Pipeline
The general architecture for external data loading follows a staged pipeline:
Stage 1: File Discovery
- Locate data files on disk or remote storage
- Determine file format (DBN, CSV, JSON, etc.)
Stage 2: Schema Detection
- Inspect file headers or metadata to determine the data schema
- Map vendor schema to canonical data type(s)
Stage 3: Deserialization
- Decode binary/text records into structured objects
- Apply precision settings for price and size fields
- Map vendor instrument symbols to internal InstrumentIds
Stage 4: Type Conversion (optional)
- Convert between internal representations (e.g., Rust pyo3 objects to Cython objects)
- Merge related data streams (e.g., MBP-1 produces both QuoteTick and optional TradeTick)
Stage 5: Output
- Return a list of typed Data objects ready for catalog ingestion or direct use
Schema-to-Type Mapping
A core component of external data loading is the mapping table from vendor schemas to internal types:
Vendor Schema -> Internal Type(s)
-----------------------------------------------
MBO -> OrderBookDelta
MBP_1 / TBBO -> QuoteTick (+ optional TradeTick)
MBP_10 -> OrderBookDepth10
BBO_1S / BBO_1M -> QuoteTick
TRADES -> TradeTick
OHLCV_1S/1M/1H/1D -> Bar
DEFINITION -> Instrument (subtype varies)
STATUS -> InstrumentStatus
IMBALANCE -> vendor-specific type
STATISTICS -> vendor-specific type
Precision and Symbology
function load_data(path, instrument_id=None, price_precision=None):
schema = detect_schema(path)
records = decode_file(path, schema)
for record in records:
if instrument_id is not None:
record.instrument_id = instrument_id # Override symbology
if price_precision is not None:
record.apply_precision(price_precision) # Override precision
return records
When an explicit instrument_id is provided, it overrides the vendor symbology for all records in the file. This is an optimization for single-instrument files where the identity is definitively known.
CSV vs. Binary Format Trade-offs
| Property | CSV (e.g., Tardis) | Binary (e.g., Databento DBN) |
|---|---|---|
| Parse speed | Slower (text parsing) | Faster (direct memory mapping) |
| File size | Larger | Smaller (compressed binary) |
| Schema flexibility | Column headers define schema | Embedded metadata header |
| Precision | Inferred from string representation | Fixed by encoding spec |
| Streaming | Line-by-line | Chunk-by-chunk |