Principle:Datahub project Datahub Scheduled Ingestion
| Field | Value |
|---|---|
| Principle Name | Scheduled Ingestion |
| Overview | The practice of automating recurring metadata ingestion runs using external scheduling systems. |
| Status | Active |
| Domains | Data_Integration, Metadata_Management |
| Related Implementations | Datahub_project_Datahub_Scheduled_Ingestion_Orchestration |
| Last Updated | 2026-02-10 |
| Knowledge Sources | DataHub Repository |
Description
Scheduled ingestion ensures metadata stays fresh by periodically re-running ingestion pipelines. Since DataHub's CLI is a standard Unix command, it integrates with any scheduler (Airflow, cron, CI/CD systems, Kubernetes CronJobs). Each run produces an independent execution report, allowing operators to monitor ingestion health over time.
The scheduling responsibility is deliberately kept external to DataHub's CLI. The datahub ingest run command is designed as a stateless, idempotent operation that can be safely invoked repeatedly. This separation of concerns means:
- The CLI handles execution -- parsing the recipe, running the pipeline, reporting results
- The scheduler handles timing -- when and how often to run, retries on failure, alerting
- DataHub handles state -- stateful ingestion tracks what has changed between runs via checkpointing
DataHub also provides a built-in scheduling mechanism via the datahub ingest deploy command, which registers a recipe with the DataHub server for server-side scheduled execution. This approach uses the server's built-in executor and scheduler, configured via cron expressions.
Usage
Use scheduled ingestion when metadata sources change frequently and need regular synchronization with DataHub. Common scenarios include:
- Nightly schema sync -- Ingest database schemas every night to capture DDL changes
- Hourly usage stats -- Collect query usage statistics from data warehouses
- Continuous lineage tracking -- Periodically extract lineage from ETL tools
- Dashboard metadata refresh -- Keep BI tool metadata up to date
Theoretical Basis
Scheduled ingestion follows the scheduling pattern where external orchestrators invoke the same idempotent CLI command at defined intervals. The ingestion is designed to be:
- Stateless -- Each run reads the current state of the source and emits it to DataHub; no local state persists between runs
- Idempotent -- Running the same recipe multiple times produces the same result (metadata is upserted, not duplicated)
- Observable -- Each run generates a structured report with success/failure status, enabling monitoring and alerting
The combination of stateless execution and idempotent writes means that a missed or failed run simply results in slightly stale metadata, which is corrected by the next successful run.
For sources that support it, stateful ingestion optimizes scheduled runs by tracking the last-seen state via checkpoints. This enables:
- Soft deletes -- Detecting entities that no longer exist in the source and marking them as removed
- Incremental extraction -- Only processing changes since the last successful run
Scheduling Approaches
| Approach | Description | Best For |
|---|---|---|
| cron | Unix cron scheduler for simple periodic execution | Simple deployments, single-server setups |
| Airflow | Apache Airflow DAGs with BashOperator or PythonOperator | Complex workflows with dependencies and monitoring |
| CI/CD | GitHub Actions, GitLab CI, Jenkins scheduled pipelines | GitOps workflows, recipe-as-code |
| Kubernetes CronJob | Kubernetes-native scheduling for containerized execution | Cloud-native deployments |
| DataHub Server | Built-in scheduling via datahub ingest deploy with cron expressions |
Centralized management, no external scheduler needed |
Constraints
- The recipe file and all referenced credentials must be accessible at execution time
- Environment variables used in recipes must be available in the scheduler's execution environment
- Long-running ingestions may overlap with the next scheduled run; consider using pipeline locking or sequential scheduling
- Server-side scheduling (via
datahub ingest deploy) requires a running DataHub executor
Related Pages
- Implemented by: Datahub_project_Datahub_Scheduled_Ingestion_Orchestration
Implementation:Datahub_project_Datahub_Scheduled_Ingestion_Orchestration