Principle:Datahub project Datahub Batch Ingestion Execution

Field	Value
Principle Name	Batch Ingestion Execution
Overview	The orchestrated process of extracting metadata from sources, transforming it, and loading it into DataHub via a pipeline execution model.
Status	Active
Domains	Data_Integration, Metadata_Management
Related Implementations	Datahub_project_Datahub_Ingest_CLI_Run
Last Updated	2026-02-10
Knowledge Sources	DataHub Repository

Description

Batch ingestion execution is DataHub's ETL pipeline for metadata. It coordinates source extraction, optional transformation, and sink emission. The pipeline manages the full lifecycle including initialization, execution, error handling, reporting, and graceful shutdown.

The execution flow proceeds through these stages:

Initialization -- The Pipeline class parses the recipe configuration, instantiates the source, sink, transformers, and extractor, and establishes a connection to the DataHub GMS server.
Extraction -- The source's get_workunits() method produces a stream of MetadataWorkUnit objects, each representing a metadata change proposal.
Record extraction -- The extractor converts work units into RecordEnvelope objects via get_records().
Transformation -- Each record envelope passes through the configured transformer chain. Transformers can modify, enrich, or filter metadata records.
Loading -- The sink receives transformed records via write_record_async() and emits them to the DataHub backend (REST API, Kafka, or file).
Completion -- The pipeline commits stateful ingestion state, reports results, and performs cleanup.

The pipeline supports several execution modes:

Normal mode -- Full extraction, transformation, and loading
Dry run mode -- Extraction and transformation without writing to the sink
Preview mode -- Limited extraction (configurable number of work units) for quick validation
Test source connection -- Only tests whether the source can be connected to

Usage

Use batch ingestion execution when running a one-time or on-demand metadata ingestion from any supported source to DataHub. This is the primary mechanism for populating DataHub with metadata.

Theoretical Basis

Batch ingestion execution follows the ETL (Extract-Transform-Load) pipeline pattern. The pipeline orchestrates a streaming flow:

Source.get_workunits() --> Extractor.get_records() --> Transformer.transform() --> Sink.write_record_async()

Key design principles:

Streaming execution -- Work units are processed one at a time in a stream, avoiding the need to load all metadata into memory at once
Error isolation -- Failures in individual work units are captured and reported without terminating the entire pipeline
Composable transformers -- Transformers form a chain where each transformer's output feeds the next
Stateful ingestion -- The source can maintain state across runs (via checkpointing) to enable incremental extraction
Graceful shutdown -- The pipeline handles SystemExit and KeyboardInterrupt to clean up resources

Pipeline Lifecycle

Stage	Status	Description
Initialization	UNKNOWN	Pipeline components are created and configured
Running	UNKNOWN	Work units are being processed
Completed	COMPLETED	All work units processed successfully
Error	ERROR	An uncaught exception occurred during execution
Cancelled	CANCELLED	The pipeline was interrupted by the user or system

Constraints

The source must be a registered plugin that implements the Source interface
The sink must implement the Sink interface with write_record_async()
Transformers must implement the Transformer interface with transform()
Memory usage is proportional to the size of individual work units, not the total dataset
The pipeline reports progress at 60-second intervals during execution

Related Pages

Implemented by: Datahub_project_Datahub_Ingest_CLI_Run

Implementation:Datahub_project_Datahub_Ingest_CLI_Run

Related: Datahub_project_Datahub_Recipe_Configuration
Related: Datahub_project_Datahub_PipelineConfig
Related: Datahub_project_Datahub_Scheduled_Ingestion

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment