Principle:Datahub project Datahub Batch Ingestion Execution
| Field | Value |
|---|---|
| Principle Name | Batch Ingestion Execution |
| Overview | The orchestrated process of extracting metadata from sources, transforming it, and loading it into DataHub via a pipeline execution model. |
| Status | Active |
| Domains | Data_Integration, Metadata_Management |
| Related Implementations | Datahub_project_Datahub_Ingest_CLI_Run |
| Last Updated | 2026-02-10 |
| Knowledge Sources | DataHub Repository |
Description
Batch ingestion execution is DataHub's ETL pipeline for metadata. It coordinates source extraction, optional transformation, and sink emission. The pipeline manages the full lifecycle including initialization, execution, error handling, reporting, and graceful shutdown.
The execution flow proceeds through these stages:
- Initialization -- The
Pipelineclass parses the recipe configuration, instantiates the source, sink, transformers, and extractor, and establishes a connection to the DataHub GMS server. - Extraction -- The source's
get_workunits()method produces a stream ofMetadataWorkUnitobjects, each representing a metadata change proposal. - Record extraction -- The extractor converts work units into
RecordEnvelopeobjects viaget_records(). - Transformation -- Each record envelope passes through the configured transformer chain. Transformers can modify, enrich, or filter metadata records.
- Loading -- The sink receives transformed records via
write_record_async()and emits them to the DataHub backend (REST API, Kafka, or file). - Completion -- The pipeline commits stateful ingestion state, reports results, and performs cleanup.
The pipeline supports several execution modes:
- Normal mode -- Full extraction, transformation, and loading
- Dry run mode -- Extraction and transformation without writing to the sink
- Preview mode -- Limited extraction (configurable number of work units) for quick validation
- Test source connection -- Only tests whether the source can be connected to
Usage
Use batch ingestion execution when running a one-time or on-demand metadata ingestion from any supported source to DataHub. This is the primary mechanism for populating DataHub with metadata.
Theoretical Basis
Batch ingestion execution follows the ETL (Extract-Transform-Load) pipeline pattern. The pipeline orchestrates a streaming flow:
Source.get_workunits() --> Extractor.get_records() --> Transformer.transform() --> Sink.write_record_async()
Key design principles:
- Streaming execution -- Work units are processed one at a time in a stream, avoiding the need to load all metadata into memory at once
- Error isolation -- Failures in individual work units are captured and reported without terminating the entire pipeline
- Composable transformers -- Transformers form a chain where each transformer's output feeds the next
- Stateful ingestion -- The source can maintain state across runs (via checkpointing) to enable incremental extraction
- Graceful shutdown -- The pipeline handles
SystemExitandKeyboardInterruptto clean up resources
Pipeline Lifecycle
| Stage | Status | Description |
|---|---|---|
| Initialization | UNKNOWN | Pipeline components are created and configured |
| Running | UNKNOWN | Work units are being processed |
| Completed | COMPLETED | All work units processed successfully |
| Error | ERROR | An uncaught exception occurred during execution |
| Cancelled | CANCELLED | The pipeline was interrupted by the user or system |
Constraints
- The source must be a registered plugin that implements the
Sourceinterface - The sink must implement the
Sinkinterface withwrite_record_async() - Transformers must implement the
Transformerinterface withtransform() - Memory usage is proportional to the size of individual work units, not the total dataset
- The pipeline reports progress at 60-second intervals during execution
Related Pages
- Implemented by: Datahub_project_Datahub_Ingest_CLI_Run