Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Nautechsystems Nautilus trader Backtest Data Injection

From Leeroopedia


Field Value
sources https://github.com/nautechsystems/nautilus_trader , https://nautilustrader.io/docs/
domains backtesting, data management, event streaming
last_updated 2026-02-10 12:00 GMT

Overview

Backtest Data Injection is the principle of feeding pre-processed historical market data objects into a backtesting engine's internal data stream so that the simulation loop can replay them in chronological order.

Description

Once raw market data has been wrangled into strongly-typed objects (trade ticks, quote ticks, bars, order book deltas), those objects must be injected into the engine's data pipeline. This injection step is distinct from data wrangling: wrangling converts format, while injection manages ordering, validation, and routing.

The principle addresses the following concerns:

  • Chronological ordering -- The backtest engine processes data in monotonically increasing timestamp order. Data from multiple instruments and multiple types must be merged into a single sorted stream. The injection step can sort on each addition or defer sorting to a single final pass for better performance.
  • Instrument validation -- Each data element is checked against the engine's instrument cache. If data references an instrument that has not been registered, the engine raises an error early rather than failing mid-simulation.
  • Client routing -- Data that belongs to a known instrument is automatically routed to the venue's market data client. Custom data or data without an instrument ID must be paired with an explicit client_id.
  • Book data tracking -- The engine tracks which instruments have order-book data. This is critical for venues configured with L2/L3 book types: if an instrument has general data but no book data, the engine raises an error at run time.
  • Subscription registration -- Injected data types are recorded as backtest subscriptions so that the data engine knows which data streams are available without needing live subscription requests.
  • Performance optimization -- For large datasets, the engine supports a deferred-sort pattern: add multiple data streams with sort=False, then call sort_data() once. This avoids O(N log N) sorting on every addition.

Usage

Apply this principle whenever you need to:

  • Load one or more historical data streams into a backtest engine.
  • Combine data from multiple instruments or data types into a unified timeline.
  • Optimize data loading performance for large datasets.
  • Inject custom (non-market) data alongside standard market data.

Theoretical Basis

Data injection for event-driven simulation is essentially a merge-sort ingestion pipeline. Each data source is a pre-sorted stream; the engine merges them into a single globally-sorted stream.

Key theoretical elements:

  • Timestamp-ordered event stream -- The backtest loop iterates over data in ts_init order. The injection step must guarantee this invariant, either eagerly (sort on each add) or lazily (sort once before run).
  • Copy semantics -- The engine copies the input list internally (self._data.extend(data)), preventing external mutations from affecting the engine state after injection.
  • Validation as contract enforcement -- When validate=True, the engine checks that the first element's instrument is in the cache, that bars have EXTERNAL aggregation source, and that custom data has an associated client ID. These checks are O(1) on the first element, assuming all elements in the list are of the same type.
  • Iterator synchronization -- After sorting, the data is synced to a BacktestDataIterator which provides O(1) access to the next event during the simulation loop. The iterator is append-only and pre-sorted.
  • Deferred sort pattern -- Adding data with sort=False sets a _sorted flag to False. The run() method checks this flag and raises a RuntimeError if data is unsorted, enforcing the contract that sort_data() must be called.

Pseudocode:

FUNCTION inject_data(engine, data_list, client_id, validate, sort):
    VALIDATE data_list is not empty
    VALIDATE all elements are of type Data

    IF validate:
        first = data_list[0]
        IF first has instrument_id:
            ASSERT instrument_id is in cache
            TRACK instrument_id in has_data set
        ELIF first is Bar:
            ASSERT bar_type.instrument_id is in cache
            ASSERT aggregation_source == EXTERNAL
        ELSE:
            ASSERT client_id is not None

        IF first is book data type:
            TRACK instrument_id in has_book_data set

    EXTEND engine.data with data_list

    IF sort:
        SORT engine.data by ts_init
        SYNC sorted data to iterator
        SET sorted_flag = True
    ELSE:
        SET sorted_flag = False

    REGISTER subscription names for data types

    LOG "Added {count} elements"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment