Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Datahub project Datahub Batch Ingestion Execution

From Leeroopedia


Field Value
Principle Name Batch Ingestion Execution
Overview The orchestrated process of extracting metadata from sources, transforming it, and loading it into DataHub via a pipeline execution model.
Status Active
Domains Data_Integration, Metadata_Management
Related Implementations Datahub_project_Datahub_Ingest_CLI_Run
Last Updated 2026-02-10
Knowledge Sources DataHub Repository

Description

Batch ingestion execution is DataHub's ETL pipeline for metadata. It coordinates source extraction, optional transformation, and sink emission. The pipeline manages the full lifecycle including initialization, execution, error handling, reporting, and graceful shutdown.

The execution flow proceeds through these stages:

  1. Initialization -- The Pipeline class parses the recipe configuration, instantiates the source, sink, transformers, and extractor, and establishes a connection to the DataHub GMS server.
  2. Extraction -- The source's get_workunits() method produces a stream of MetadataWorkUnit objects, each representing a metadata change proposal.
  3. Record extraction -- The extractor converts work units into RecordEnvelope objects via get_records().
  4. Transformation -- Each record envelope passes through the configured transformer chain. Transformers can modify, enrich, or filter metadata records.
  5. Loading -- The sink receives transformed records via write_record_async() and emits them to the DataHub backend (REST API, Kafka, or file).
  6. Completion -- The pipeline commits stateful ingestion state, reports results, and performs cleanup.

The pipeline supports several execution modes:

  • Normal mode -- Full extraction, transformation, and loading
  • Dry run mode -- Extraction and transformation without writing to the sink
  • Preview mode -- Limited extraction (configurable number of work units) for quick validation
  • Test source connection -- Only tests whether the source can be connected to

Usage

Use batch ingestion execution when running a one-time or on-demand metadata ingestion from any supported source to DataHub. This is the primary mechanism for populating DataHub with metadata.

Theoretical Basis

Batch ingestion execution follows the ETL (Extract-Transform-Load) pipeline pattern. The pipeline orchestrates a streaming flow:

Source.get_workunits() --> Extractor.get_records() --> Transformer.transform() --> Sink.write_record_async()

Key design principles:

  • Streaming execution -- Work units are processed one at a time in a stream, avoiding the need to load all metadata into memory at once
  • Error isolation -- Failures in individual work units are captured and reported without terminating the entire pipeline
  • Composable transformers -- Transformers form a chain where each transformer's output feeds the next
  • Stateful ingestion -- The source can maintain state across runs (via checkpointing) to enable incremental extraction
  • Graceful shutdown -- The pipeline handles SystemExit and KeyboardInterrupt to clean up resources

Pipeline Lifecycle

Stage Status Description
Initialization UNKNOWN Pipeline components are created and configured
Running UNKNOWN Work units are being processed
Completed COMPLETED All work units processed successfully
Error ERROR An uncaught exception occurred during execution
Cancelled CANCELLED The pipeline was interrupted by the user or system

Constraints

  • The source must be a registered plugin that implements the Source interface
  • The sink must implement the Sink interface with write_record_async()
  • Transformers must implement the Transformer interface with transform()
  • Memory usage is proportional to the size of individual work units, not the total dataset
  • The pipeline reports progress at 60-second intervals during execution

Related Pages

Implementation:Datahub_project_Datahub_Ingest_CLI_Run

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment