Principle:Nautechsystems Nautilus trader External Data Loading

Field	Value
Sources	GitHub Repository, NautilusTrader Documentation
Domains	Market Data Ingestion, Data Formats, ETL Pipelines
Last Updated	2026-02-10 12:00 GMT

Overview

External data loading is the process of ingesting market data from third-party providers and proprietary file formats into a normalized, typed object model suitable for backtesting, analysis, and live trading.

Description

Financial market data originates from a variety of vendors, exchanges, and data providers, each with their own proprietary file formats, encoding schemes, and schema conventions. Before this data can be used in a trading system, it must be decoded, validated, mapped to a canonical type system, and optionally converted between internal representations. External data loading addresses this translation layer.

The key challenges that external data loading solves include:

Format heterogeneity -- Vendors deliver data in binary formats (e.g., Databento's DBN encoding), CSV files (e.g., Tardis), proprietary APIs, or database dumps. Each requires a dedicated parser.
Schema mapping -- Raw vendor schemas (MBO, MBP-1, MBP-10, TRADES, OHLCV, DEFINITION, etc.) must be mapped to the system's canonical data types (OrderBookDelta, QuoteTick, TradeTick, Bar, Instrument, etc.).
Symbology translation -- Vendor-specific instrument identifiers must be mapped to the system's internal InstrumentId format (symbol + venue).
Precision handling -- Price and size precision must be correctly inferred or specified to avoid floating-point artifacts.
Performance -- Loading large historical datasets (millions of rows) requires efficient deserialization, ideally leveraging compiled code paths (Rust/Cython) rather than pure Python.

Usage

External data loading is used in the following contexts:

Historical data preparation -- Converting vendor-supplied data files into typed objects before writing them to a data catalog.
Backtesting pipelines -- Loading data files directly for immediate consumption by a backtest engine.
Data validation -- Inspecting loaded data objects to verify correctness of symbology, precision, and timestamp ordering.
Multi-vendor workflows -- Combining data from multiple providers (e.g., Databento for US equities, Tardis for crypto) into a unified catalog.

Theoretical Basis

Data Loading Pipeline

The general architecture for external data loading follows a staged pipeline:

Stage 1: File Discovery
    - Locate data files on disk or remote storage
    - Determine file format (DBN, CSV, JSON, etc.)

Stage 2: Schema Detection
    - Inspect file headers or metadata to determine the data schema
    - Map vendor schema to canonical data type(s)

Stage 3: Deserialization
    - Decode binary/text records into structured objects
    - Apply precision settings for price and size fields
    - Map vendor instrument symbols to internal InstrumentIds

Stage 4: Type Conversion (optional)
    - Convert between internal representations (e.g., Rust pyo3 objects to Cython objects)
    - Merge related data streams (e.g., MBP-1 produces both QuoteTick and optional TradeTick)

Stage 5: Output
    - Return a list of typed Data objects ready for catalog ingestion or direct use

Schema-to-Type Mapping

A core component of external data loading is the mapping table from vendor schemas to internal types:

Vendor Schema         ->  Internal Type(s)
-----------------------------------------------
MBO                   ->  OrderBookDelta
MBP_1 / TBBO         ->  QuoteTick (+ optional TradeTick)
MBP_10                ->  OrderBookDepth10
BBO_1S / BBO_1M       ->  QuoteTick
TRADES                ->  TradeTick
OHLCV_1S/1M/1H/1D    ->  Bar
DEFINITION            ->  Instrument (subtype varies)
STATUS                ->  InstrumentStatus
IMBALANCE             ->  vendor-specific type
STATISTICS            ->  vendor-specific type

Precision and Symbology

function load_data(path, instrument_id=None, price_precision=None):
    schema = detect_schema(path)
    records = decode_file(path, schema)

    for record in records:
        if instrument_id is not None:
            record.instrument_id = instrument_id   # Override symbology
        if price_precision is not None:
            record.apply_precision(price_precision)  # Override precision

    return records

When an explicit instrument_id is provided, it overrides the vendor symbology for all records in the file. This is an optimization for single-instrument files where the identity is definitively known.

CSV vs. Binary Format Trade-offs

Property	CSV (e.g., Tardis)	Binary (e.g., Databento DBN)
Parse speed	Slower (text parsing)	Faster (direct memory mapping)
File size	Larger	Smaller (compressed binary)
Schema flexibility	Column headers define schema	Embedded metadata header
Precision	Inferred from string representation	Fixed by encoding spec
Streaming	Line-by-line	Chunk-by-chunk

Related Pages

Implementation:Nautechsystems_Nautilus_trader_DatabentoDataLoader_From_Dbn

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment