Principle:Datahub project Datahub Scheduled Ingestion

Field	Value
Principle Name	Scheduled Ingestion
Overview	The practice of automating recurring metadata ingestion runs using external scheduling systems.
Status	Active
Domains	Data_Integration, Metadata_Management
Related Implementations	Datahub_project_Datahub_Scheduled_Ingestion_Orchestration
Last Updated	2026-02-10
Knowledge Sources	DataHub Repository

Description

Scheduled ingestion ensures metadata stays fresh by periodically re-running ingestion pipelines. Since DataHub's CLI is a standard Unix command, it integrates with any scheduler (Airflow, cron, CI/CD systems, Kubernetes CronJobs). Each run produces an independent execution report, allowing operators to monitor ingestion health over time.

The scheduling responsibility is deliberately kept external to DataHub's CLI. The datahub ingest run command is designed as a stateless, idempotent operation that can be safely invoked repeatedly. This separation of concerns means:

The CLI handles execution -- parsing the recipe, running the pipeline, reporting results
The scheduler handles timing -- when and how often to run, retries on failure, alerting
DataHub handles state -- stateful ingestion tracks what has changed between runs via checkpointing

DataHub also provides a built-in scheduling mechanism via the datahub ingest deploy command, which registers a recipe with the DataHub server for server-side scheduled execution. This approach uses the server's built-in executor and scheduler, configured via cron expressions.

Usage

Use scheduled ingestion when metadata sources change frequently and need regular synchronization with DataHub. Common scenarios include:

Nightly schema sync -- Ingest database schemas every night to capture DDL changes
Hourly usage stats -- Collect query usage statistics from data warehouses
Continuous lineage tracking -- Periodically extract lineage from ETL tools
Dashboard metadata refresh -- Keep BI tool metadata up to date

Theoretical Basis

Scheduled ingestion follows the scheduling pattern where external orchestrators invoke the same idempotent CLI command at defined intervals. The ingestion is designed to be:

Stateless -- Each run reads the current state of the source and emits it to DataHub; no local state persists between runs
Idempotent -- Running the same recipe multiple times produces the same result (metadata is upserted, not duplicated)
Observable -- Each run generates a structured report with success/failure status, enabling monitoring and alerting

The combination of stateless execution and idempotent writes means that a missed or failed run simply results in slightly stale metadata, which is corrected by the next successful run.

For sources that support it, stateful ingestion optimizes scheduled runs by tracking the last-seen state via checkpoints. This enables:

Soft deletes -- Detecting entities that no longer exist in the source and marking them as removed
Incremental extraction -- Only processing changes since the last successful run

Scheduling Approaches

Approach	Description	Best For
cron	Unix cron scheduler for simple periodic execution	Simple deployments, single-server setups
Airflow	Apache Airflow DAGs with BashOperator or PythonOperator	Complex workflows with dependencies and monitoring
CI/CD	GitHub Actions, GitLab CI, Jenkins scheduled pipelines	GitOps workflows, recipe-as-code
Kubernetes CronJob	Kubernetes-native scheduling for containerized execution	Cloud-native deployments
DataHub Server	Built-in scheduling via `datahub ingest deploy` with cron expressions	Centralized management, no external scheduler needed

Constraints

The recipe file and all referenced credentials must be accessible at execution time
Environment variables used in recipes must be available in the scheduler's execution environment
Long-running ingestions may overlap with the next scheduled run; consider using pipeline locking or sequential scheduling
Server-side scheduling (via datahub ingest deploy) requires a running DataHub executor

Related Pages

Implemented by: Datahub_project_Datahub_Scheduled_Ingestion_Orchestration

Implementation:Datahub_project_Datahub_Scheduled_Ingestion_Orchestration

Related: Datahub_project_Datahub_Batch_Ingestion_Execution
Related: Datahub_project_Datahub_Ingest_CLI_Run
Related: Datahub_project_Datahub_Recipe_Configuration

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment