Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Nautechsystems Nautilus trader Historical Data Wrangling

From Leeroopedia


Field Value
sources https://github.com/nautechsystems/nautilus_trader , https://nautilustrader.io/docs/
domains data engineering, backtesting, market data transformation
last_updated 2026-02-10 12:00 GMT

Overview

Historical Data Wrangling is the principle of transforming raw, tabular market data (typically pandas DataFrames) into strongly-typed, nanosecond-precision trading objects that an event-driven backtesting engine can consume.

Description

Historical market data arrives in many formats -- CSV files, Parquet files, database exports, vendor-specific APIs -- but almost always lands in a tabular, row-oriented representation such as a pandas DataFrame. An event-driven backtest engine, however, requires individual, strongly-typed data objects (e.g., TradeTick, QuoteTick, Bar) with precise timestamps and instrument-specific scaling. The wrangling step bridges this gap.

The principle addresses several concerns:

  • Type conversion -- Raw columns (price, quantity, trade_id, side) must be converted into domain objects with the correct precision (determined by the instrument specification).
  • Timestamp normalization -- All timestamps must be UTC, represented as nanoseconds since the Unix epoch. A configurable ts_init_delta allows simulation of network latency (the difference between the event timestamp and when the system "receives" it).
  • Raw data handling -- Some data sources provide pre-scaled fixed-point integers (Nautilus internal format). The wrangler must detect and reverse this scaling before constructing objects.
  • Bar-to-tick synthesis -- When only OHLCV bar data is available, the wrangler can synthesize trade ticks from bar prices, generating open/high/low/close ticks with sub-bar timestamp offsets. This enables bar-level backtesting with tick-level execution semantics.
  • Aggressor side inference -- If the source data includes a side or buyer_maker column, the wrangler maps it to the AggressorSide enum; otherwise it defaults to NO_AGGRESSOR.

Usage

Apply this principle whenever you need to:

  • Load historical trade data from CSV/Parquet into NautilusTrader's backtest engine.
  • Convert bar (OHLCV) data into synthetic trade ticks for engines that require tick-level input.
  • Simulate latency between a data source and the trading system.
  • Handle vendor-specific data formats with non-standard column names or scaling.

Theoretical Basis

Data wrangling for backtesting can be understood as a type-safe ETL pipeline that transforms tabular data into a stream of domain events.

Key theoretical elements:

  • Instrument-driven precision -- The wrangler is parameterized by an Instrument object. Every price is constructed with Price(value, instrument.price_precision) and every quantity with Quantity(value, instrument.size_precision). This guarantees that the backtest sees exactly the same decimal representation as live trading.
  • Dual timestamps -- Each data object carries two timestamps: ts_event (when the event occurred at the exchange) and ts_init (when the system initialized/received the event). The ts_init_delta parameter controls the gap: ts_init = ts_event + delta. Setting delta to zero assumes instantaneous data delivery; positive values simulate realistic latency.
  • Bar decomposition -- An OHLCV bar is decomposed into four synthetic ticks (open, high, low, close), each offset from the bar timestamp by a configurable number of milliseconds. An optional random seed shuffles the high/low ordering to avoid look-ahead bias in the intra-bar price path.
  • Vectorized construction -- For performance, the wrangler uses map() over column arrays rather than row-by-row iteration, and the Cython-level _build_tick method constructs each TradeTick at C speed.

Pseudocode:

FUNCTION wrangle_trade_ticks(instrument, dataframe, ts_init_delta, is_raw):
    VALIDATE dataframe is not empty
    CONVERT index to UTC
    COMPUTE ts_event, ts_init from index and delta

    IF is_raw:
        DESCALE price and quantity by FIXED_SCALAR

    INFER aggressor_side from 'side' or 'buyer_maker' column (or default NO_AGGRESSOR)

    FOR EACH row in (price, quantity, side, trade_id, ts_event, ts_init):
        tick = TradeTick(
            instrument.id,
            Price(price, instrument.price_precision),
            Quantity(quantity, instrument.size_precision),
            aggressor_side,
            TradeId(trade_id),
            ts_event,
            ts_init,
        )
        APPEND tick to result

    RETURN result

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment