Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Nautechsystems Nautilus trader External Data Loading

From Leeroopedia


Field Value
Sources GitHub Repository, NautilusTrader Documentation
Domains Market Data Ingestion, Data Formats, ETL Pipelines
Last Updated 2026-02-10 12:00 GMT

Overview

External data loading is the process of ingesting market data from third-party providers and proprietary file formats into a normalized, typed object model suitable for backtesting, analysis, and live trading.

Description

Financial market data originates from a variety of vendors, exchanges, and data providers, each with their own proprietary file formats, encoding schemes, and schema conventions. Before this data can be used in a trading system, it must be decoded, validated, mapped to a canonical type system, and optionally converted between internal representations. External data loading addresses this translation layer.

The key challenges that external data loading solves include:

  • Format heterogeneity -- Vendors deliver data in binary formats (e.g., Databento's DBN encoding), CSV files (e.g., Tardis), proprietary APIs, or database dumps. Each requires a dedicated parser.
  • Schema mapping -- Raw vendor schemas (MBO, MBP-1, MBP-10, TRADES, OHLCV, DEFINITION, etc.) must be mapped to the system's canonical data types (OrderBookDelta, QuoteTick, TradeTick, Bar, Instrument, etc.).
  • Symbology translation -- Vendor-specific instrument identifiers must be mapped to the system's internal InstrumentId format (symbol + venue).
  • Precision handling -- Price and size precision must be correctly inferred or specified to avoid floating-point artifacts.
  • Performance -- Loading large historical datasets (millions of rows) requires efficient deserialization, ideally leveraging compiled code paths (Rust/Cython) rather than pure Python.

Usage

External data loading is used in the following contexts:

  • Historical data preparation -- Converting vendor-supplied data files into typed objects before writing them to a data catalog.
  • Backtesting pipelines -- Loading data files directly for immediate consumption by a backtest engine.
  • Data validation -- Inspecting loaded data objects to verify correctness of symbology, precision, and timestamp ordering.
  • Multi-vendor workflows -- Combining data from multiple providers (e.g., Databento for US equities, Tardis for crypto) into a unified catalog.

Theoretical Basis

Data Loading Pipeline

The general architecture for external data loading follows a staged pipeline:

Stage 1: File Discovery
    - Locate data files on disk or remote storage
    - Determine file format (DBN, CSV, JSON, etc.)

Stage 2: Schema Detection
    - Inspect file headers or metadata to determine the data schema
    - Map vendor schema to canonical data type(s)

Stage 3: Deserialization
    - Decode binary/text records into structured objects
    - Apply precision settings for price and size fields
    - Map vendor instrument symbols to internal InstrumentIds

Stage 4: Type Conversion (optional)
    - Convert between internal representations (e.g., Rust pyo3 objects to Cython objects)
    - Merge related data streams (e.g., MBP-1 produces both QuoteTick and optional TradeTick)

Stage 5: Output
    - Return a list of typed Data objects ready for catalog ingestion or direct use

Schema-to-Type Mapping

A core component of external data loading is the mapping table from vendor schemas to internal types:

Vendor Schema         ->  Internal Type(s)
-----------------------------------------------
MBO                   ->  OrderBookDelta
MBP_1 / TBBO         ->  QuoteTick (+ optional TradeTick)
MBP_10                ->  OrderBookDepth10
BBO_1S / BBO_1M       ->  QuoteTick
TRADES                ->  TradeTick
OHLCV_1S/1M/1H/1D    ->  Bar
DEFINITION            ->  Instrument (subtype varies)
STATUS                ->  InstrumentStatus
IMBALANCE             ->  vendor-specific type
STATISTICS            ->  vendor-specific type

Precision and Symbology

function load_data(path, instrument_id=None, price_precision=None):
    schema = detect_schema(path)
    records = decode_file(path, schema)

    for record in records:
        if instrument_id is not None:
            record.instrument_id = instrument_id   # Override symbology
        if price_precision is not None:
            record.apply_precision(price_precision)  # Override precision

    return records

When an explicit instrument_id is provided, it overrides the vendor symbology for all records in the file. This is an optimization for single-instrument files where the identity is definitively known.

CSV vs. Binary Format Trade-offs

Property CSV (e.g., Tardis) Binary (e.g., Databento DBN)
Parse speed Slower (text parsing) Faster (direct memory mapping)
File size Larger Smaller (compressed binary)
Schema flexibility Column headers define schema Embedded metadata header
Precision Inferred from string representation Fixed by encoding spec
Streaming Line-by-line Chunk-by-chunk

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment