Principle:Nautechsystems Nautilus trader Historical Data Wrangling
| Field | Value |
|---|---|
| sources | https://github.com/nautechsystems/nautilus_trader , https://nautilustrader.io/docs/ |
| domains | data engineering, backtesting, market data transformation |
| last_updated | 2026-02-10 12:00 GMT |
Overview
Historical Data Wrangling is the principle of transforming raw, tabular market data (typically pandas DataFrames) into strongly-typed, nanosecond-precision trading objects that an event-driven backtesting engine can consume.
Description
Historical market data arrives in many formats -- CSV files, Parquet files, database exports, vendor-specific APIs -- but almost always lands in a tabular, row-oriented representation such as a pandas DataFrame. An event-driven backtest engine, however, requires individual, strongly-typed data objects (e.g., TradeTick, QuoteTick, Bar) with precise timestamps and instrument-specific scaling. The wrangling step bridges this gap.
The principle addresses several concerns:
- Type conversion -- Raw columns (
price,quantity,trade_id,side) must be converted into domain objects with the correct precision (determined by the instrument specification). - Timestamp normalization -- All timestamps must be UTC, represented as nanoseconds since the Unix epoch. A configurable
ts_init_deltaallows simulation of network latency (the difference between the event timestamp and when the system "receives" it). - Raw data handling -- Some data sources provide pre-scaled fixed-point integers (Nautilus internal format). The wrangler must detect and reverse this scaling before constructing objects.
- Bar-to-tick synthesis -- When only OHLCV bar data is available, the wrangler can synthesize trade ticks from bar prices, generating open/high/low/close ticks with sub-bar timestamp offsets. This enables bar-level backtesting with tick-level execution semantics.
- Aggressor side inference -- If the source data includes a
sideorbuyer_makercolumn, the wrangler maps it to theAggressorSideenum; otherwise it defaults toNO_AGGRESSOR.
Usage
Apply this principle whenever you need to:
- Load historical trade data from CSV/Parquet into NautilusTrader's backtest engine.
- Convert bar (OHLCV) data into synthetic trade ticks for engines that require tick-level input.
- Simulate latency between a data source and the trading system.
- Handle vendor-specific data formats with non-standard column names or scaling.
Theoretical Basis
Data wrangling for backtesting can be understood as a type-safe ETL pipeline that transforms tabular data into a stream of domain events.
Key theoretical elements:
- Instrument-driven precision -- The wrangler is parameterized by an
Instrumentobject. Every price is constructed withPrice(value, instrument.price_precision)and every quantity withQuantity(value, instrument.size_precision). This guarantees that the backtest sees exactly the same decimal representation as live trading. - Dual timestamps -- Each data object carries two timestamps:
ts_event(when the event occurred at the exchange) andts_init(when the system initialized/received the event). Thets_init_deltaparameter controls the gap:ts_init = ts_event + delta. Setting delta to zero assumes instantaneous data delivery; positive values simulate realistic latency. - Bar decomposition -- An OHLCV bar is decomposed into four synthetic ticks (open, high, low, close), each offset from the bar timestamp by a configurable number of milliseconds. An optional random seed shuffles the high/low ordering to avoid look-ahead bias in the intra-bar price path.
- Vectorized construction -- For performance, the wrangler uses
map()over column arrays rather than row-by-row iteration, and the Cython-level_build_tickmethod constructs eachTradeTickat C speed.
Pseudocode:
FUNCTION wrangle_trade_ticks(instrument, dataframe, ts_init_delta, is_raw):
VALIDATE dataframe is not empty
CONVERT index to UTC
COMPUTE ts_event, ts_init from index and delta
IF is_raw:
DESCALE price and quantity by FIXED_SCALAR
INFER aggressor_side from 'side' or 'buyer_maker' column (or default NO_AGGRESSOR)
FOR EACH row in (price, quantity, side, trade_id, ts_event, ts_init):
tick = TradeTick(
instrument.id,
Price(price, instrument.price_precision),
Quantity(quantity, instrument.size_precision),
aggressor_side,
TradeId(trade_id),
ts_event,
ts_init,
)
APPEND tick to result
RETURN result