Heuristic:Nautechsystems Nautilus trader Streaming Mode For Large Backtests
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Backtesting |
| Last Updated | 2026-02-10 08:30 GMT |
Overview
Memory optimization technique using streaming mode with data generators to process backtest datasets larger than available memory.
Description
The `BacktestEngine.run()` method supports a `streaming` mode that allows processing datasets that exceed available RAM. Instead of loading all historical data at once, data is loaded in batches via Python generators. The engine processes each batch, then clears it from memory before loading the next. This is particularly important for multi-instrument or tick-level backtests spanning long time periods.
The `add_data_iterator()` method enables dynamic streaming by registering named generators that yield data chunks on demand.
Usage
Use this heuristic when you are running backtests with large datasets that cause out-of-memory errors or excessive swap usage. Specifically:
- Tick-level backtests spanning months or years
- Multi-instrument backtests with many concurrent data streams
- Running on machines with limited RAM (e.g., < 16GB)
- Using the `BacktestEngine` directly (not `BacktestNode`)
The Insight (Rule of Thumb)
- Action: Use `streaming=True` in `BacktestEngine.run()` and load data in batches.
- Sequence:
- Add initial data batch and strategies
- Call `run(streaming=True)` to process the batch
- Call `clear_data()` to free memory
- Add the next batch of data
- Call `run(streaming=False)` or `end()` for the final batch
- Alternative: Use `add_data_iterator()` to register generators that yield data chunks on demand (preferred for new code).
- Trade-off: Streaming mode does not currently support custom data types (e.g., option Greeks). Only standard NautilusTrader data types work in streaming mode.
Reasoning
Large backtests can easily consume tens of gigabytes of memory when all tick data is loaded at once. For example, a year of tick data for a single liquid futures instrument can be 5-10GB. Loading multiple instruments simultaneously quickly exceeds typical machine memory. Streaming mode solves this by processing data incrementally, keeping only the active batch in memory.
The `DataIterator` class manages time-ordered multiplexing across multiple data streams, supporting both static data lists and dynamic generators. When using generators, data chunks are fetched on demand and integrated into the time-ordered event stream.
Code Evidence
Streaming mode documentation from `backtest/engine.pyx:1305-1312`:
# For datasets larger than available memory, use `streaming` mode with the
# following sequence:
# - 1. Add initial data batch and strategies
# - 2. Call `run(streaming=True)`
# - 3. Call `clear_data()`
# - 4. Add next batch of data stream
# - 5. Call `run(streaming=False)` or `end()` when processing the final batch
Streaming parameter behavior from `backtest/engine.pyx:1327-1330`:
# Controls data loading and processing mode:
# - If False (default): Loads all data at once.
# This is currently the only supported mode for custom data.
# - If True, loads data in chunks for memory-efficient processing.
Generator streaming from `backtest/engine.pyx:2236-2239`:
# This method enables memory-efficient processing of large datasets by using
# Python generators that yield data chunks on-demand. The generator is called
# incrementally as data is consumed, allowing datasets larger than available
# memory to be processed.