Principle:Nautechsystems Nautilus trader Backtest Data Injection
| Field | Value |
|---|---|
| sources | https://github.com/nautechsystems/nautilus_trader , https://nautilustrader.io/docs/ |
| domains | backtesting, data management, event streaming |
| last_updated | 2026-02-10 12:00 GMT |
Overview
Backtest Data Injection is the principle of feeding pre-processed historical market data objects into a backtesting engine's internal data stream so that the simulation loop can replay them in chronological order.
Description
Once raw market data has been wrangled into strongly-typed objects (trade ticks, quote ticks, bars, order book deltas), those objects must be injected into the engine's data pipeline. This injection step is distinct from data wrangling: wrangling converts format, while injection manages ordering, validation, and routing.
The principle addresses the following concerns:
- Chronological ordering -- The backtest engine processes data in monotonically increasing timestamp order. Data from multiple instruments and multiple types must be merged into a single sorted stream. The injection step can sort on each addition or defer sorting to a single final pass for better performance.
- Instrument validation -- Each data element is checked against the engine's instrument cache. If data references an instrument that has not been registered, the engine raises an error early rather than failing mid-simulation.
- Client routing -- Data that belongs to a known instrument is automatically routed to the venue's market data client. Custom data or data without an instrument ID must be paired with an explicit
client_id. - Book data tracking -- The engine tracks which instruments have order-book data. This is critical for venues configured with L2/L3 book types: if an instrument has general data but no book data, the engine raises an error at run time.
- Subscription registration -- Injected data types are recorded as backtest subscriptions so that the data engine knows which data streams are available without needing live subscription requests.
- Performance optimization -- For large datasets, the engine supports a deferred-sort pattern: add multiple data streams with
sort=False, then callsort_data()once. This avoids O(N log N) sorting on every addition.
Usage
Apply this principle whenever you need to:
- Load one or more historical data streams into a backtest engine.
- Combine data from multiple instruments or data types into a unified timeline.
- Optimize data loading performance for large datasets.
- Inject custom (non-market) data alongside standard market data.
Theoretical Basis
Data injection for event-driven simulation is essentially a merge-sort ingestion pipeline. Each data source is a pre-sorted stream; the engine merges them into a single globally-sorted stream.
Key theoretical elements:
- Timestamp-ordered event stream -- The backtest loop iterates over data in
ts_initorder. The injection step must guarantee this invariant, either eagerly (sort on each add) or lazily (sort once before run). - Copy semantics -- The engine copies the input list internally (
self._data.extend(data)), preventing external mutations from affecting the engine state after injection. - Validation as contract enforcement -- When
validate=True, the engine checks that the first element's instrument is in the cache, that bars haveEXTERNALaggregation source, and that custom data has an associated client ID. These checks are O(1) on the first element, assuming all elements in the list are of the same type. - Iterator synchronization -- After sorting, the data is synced to a
BacktestDataIteratorwhich provides O(1) access to the next event during the simulation loop. The iterator is append-only and pre-sorted. - Deferred sort pattern -- Adding data with
sort=Falsesets a_sortedflag toFalse. Therun()method checks this flag and raises aRuntimeErrorif data is unsorted, enforcing the contract thatsort_data()must be called.
Pseudocode:
FUNCTION inject_data(engine, data_list, client_id, validate, sort):
VALIDATE data_list is not empty
VALIDATE all elements are of type Data
IF validate:
first = data_list[0]
IF first has instrument_id:
ASSERT instrument_id is in cache
TRACK instrument_id in has_data set
ELIF first is Bar:
ASSERT bar_type.instrument_id is in cache
ASSERT aggregation_source == EXTERNAL
ELSE:
ASSERT client_id is not None
IF first is book data type:
TRACK instrument_id in has_book_data set
EXTEND engine.data with data_list
IF sort:
SORT engine.data by ts_init
SYNC sorted data to iterator
SET sorted_flag = True
ELSE:
SET sorted_flag = False
REGISTER subscription names for data types
LOG "Added {count} elements"