Heuristic:Datahub project Datahub Batch Size And Timeout Tuning

Knowledge Sources	delete_cli.py env_vars.py
Domains	Optimization, Configuration, Metadata_Ingestion
Last Updated	2026-02-10 00:00 GMT

Overview

Configuration guidance for batch sizes, retry counts, and timeouts across DataHub CLI operations, REST emitter, and sink to avoid timeouts and optimize throughput.

Description

DataHub exposes multiple tuning knobs for controlling batch sizes, retry behavior, and timeouts across its CLI, REST emitter, and sink components. These defaults are carefully chosen to balance throughput against server stability. Exceeding these limits (especially batch sizes) can cause server-side timeouts, while setting them too low wastes throughput. The defaults are encoded as environment variables with sensible fallbacks, but production deployments often need adjustment based on network latency, GMS server capacity, and dataset size.

Usage

Use this heuristic when tuning DataHub for production ingestion at scale, when experiencing timeout errors during bulk operations (delete, ingest), or when optimizing throughput for large metadata emissions. Also relevant when configuring the REST emitter for high-volume programmatic use.

The Insight (Rule of Thumb)

Action: Set batch sizes within documented limits; never exceed 5000 for delete operations.
Value:
- Delete batch size: default `3000`, maximum `5000` (enforced in CLI)
- REST emitter max payload: `15MB` per batch (default `DATAHUB_REST_EMITTER_BATCH_MAX_PAYLOAD_BYTES`)
- REST emitter max MCPs per batch: `200` (default `DATAHUB_REST_EMITTER_BATCH_MAX_PAYLOAD_LENGTH`)
- REST emitter retry max: `4` attempts (default `DATAHUB_REST_EMITTER_DEFAULT_RETRY_MAX_TIMES`)
- REST emitter 429 retry multiplier: `2` (default `DATAHUB_REST_EMITTER_429_RETRY_MULTIPLIER`)
- REST sink max threads: `15` (default `DATAHUB_REST_SINK_DEFAULT_MAX_THREADS`)
- Actions pipeline retry: `0` (default, no retries unless explicitly configured)
Trade-off: Larger batches increase throughput but risk timeouts; more retries add resilience but increase latency on failures.

Reasoning

The delete CLI enforces a hard maximum of 5000 entities per batch because GMS processes deletions synchronously, and larger batches exceed the server's request timeout window. The default of 3000 provides a safety margin. The REST emitter's 15MB payload limit prevents HTTP request body size violations at the load balancer or GMS level. The 200 MCP limit per batch keeps individual request processing time predictable. The retry multiplier of 2 for 429 (rate limit) responses implements exponential backoff, effectively giving `4 * 2 = 8` total attempts for rate-limited requests.

The Actions framework defaults to 0 retries because action handlers are expected to be idempotent, and automatic retries could cause duplicate side effects (e.g., sending duplicate Slack notifications).

Code Evidence

Delete batch size limit from `delete_cli.py:229-234`:

# Maximum batch size is 5000. Large batch sizes may cause timeouts.
# Default is 3000

REST emitter configuration from `env_vars.py:72-104`:

def get_rest_emitter_default_retry_max_times() -> str:
    """Max retry attempts for failed requests."""
    return os.getenv("DATAHUB_REST_EMITTER_DEFAULT_RETRY_MAX_TIMES", "4")

def get_rest_emitter_429_retry_multiplier() -> int:
    """Multiplier for 429 retry backoff."""
    return int(os.getenv("DATAHUB_REST_EMITTER_429_RETRY_MULTIPLIER", "2"))

Actions default retry from `pipeline.py:38`:

DEFAULT_RETRY_COUNT = 0  # do not retry unless instructed

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment