Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Mage ai Mage ai Batch Size Tuning

From Leeroopedia



Knowledge Sources
Domains Optimization, Data_Integration
Last Updated 2026-02-09 07:00 GMT

Overview

Two-tier batch size strategy balancing memory usage and throughput: 50K record fetch limit for sources and 100 MB byte-size cap for destinations.

Description

The mage-integrations framework uses a layered batch size strategy. Sources use a configurable row-count limit (default 50K per batch, 10K per sub-batch) to control memory during extraction. Destinations use a byte-size limit (default 100 MB) to control memory during loading. Individual destination connectors further refine this with connector-specific record-count limits (e.g., 1000 for SQL, 500 for BigQuery). These defaults represent empirically tuned values that balance throughput against memory pressure.

Usage

Apply this heuristic when:

  • Tuning extraction performance: Adjust `batch_fetch_limit` and `subbatch_fetch_limit` in source config to control memory usage during SQL or API extraction.
  • Experiencing OOM errors: Reduce batch limits if the destination runs out of memory during batch processing.
  • Optimizing throughput: Increase batch limits on high-memory machines to reduce the number of round-trips.

The Insight (Rule of Thumb)

  • Source Fetch: Default `BATCH_FETCH_LIMIT = 50000` rows per batch, `SUBBATCH_FETCH_LIMIT = 10000` rows per sub-batch. Configurable via `batch_fetch_limit` and `subbatch_fetch_limit` keys in connector config.
  • Destination Byte Cap: Default `MAXIMUM_BATCH_SIZE_MB = 100` MB. When accumulated batch data exceeds this, the framework flushes immediately. Configurable via `maximum_batch_size_mb` in destination config.
  • SQL Destination: Default `BATCH_SIZE = 1000` records per INSERT batch.
  • BigQuery Destination: Default `BATCH_SIZE = 500` records per batch (more conservative due to BigQuery API limits).
  • Trade-off: Larger batches = fewer round-trips but higher peak memory. Smaller batches = lower memory but more overhead.

Reasoning

ETL pipelines process data in batches to avoid loading entire datasets into memory. The two-tier approach (row-count for sources, byte-size for destinations) accounts for the fact that record sizes vary widely across streams. A stream with 10 columns and small strings has very different memory characteristics than one with nested JSON objects. The byte-size cap at the destination layer acts as a universal safety valve regardless of record shape.

The sub-batch limit (10K) exists because SQL sources paginate via LIMIT/OFFSET. Smaller pages reduce database lock contention and allow more frequent state checkpointing for incremental syncs.

BigQuery uses 500 instead of 1000 because its API has strict request size limits (1 MB query string, 10 MB parameters).

Code Evidence

Source batch constants from `sources/constants.py:43-46`:

BATCH_FETCH_LIMIT = 50000
SUBBATCH_FETCH_LIMIT = 10000
BATCH_FETCH_LIMIT_KEY = 'batch_fetch_limit'
SUBBATCH_FETCH_LIMIT_KEY = 'subbatch_fetch_limit'

Destination byte-size cap from `destinations/base.py:50,499-500`:

MAXIMUM_BATCH_SIZE_MB = 100

if current_byte_size >= self.config.get(
        'maximum_batch_size_mb', MAXIMUM_BATCH_SIZE_MB) * 1024 * 1024:

SQL destination batch size from `destinations/sql/base.py:23`:

BATCH_SIZE = 1000

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment