Principle:Datahub project Datahub Connection Validation
| Property | Value |
|---|---|
| Page Type | Principle |
| Workflow | Metadata_Ingestion_Pipeline |
| Concept | Pre-flight validation of data source connectivity before ingestion execution |
| Repository | https://github.com/datahub-project/datahub |
| Implemented By | Implementation:Datahub_project_Datahub_Datahub_Ingest_Dry_Run |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Description
The Connection Validation principle ensures that the connectivity and configuration correctness of a data source can be verified before committing to a full ingestion run. In any metadata ingestion pipeline, failure to connect to a data source, whether due to incorrect credentials, network restrictions, or misconfigured parameters, is one of the most common causes of pipeline failure. Detecting these issues early, before the pipeline attempts to extract and load metadata, saves time and resources.
DataHub provides two complementary mechanisms for connection validation:
- Dry-run mode (
--dry-run): Executes the full pipeline but skips writing records to the sink. This exercises the source connector's extraction logic without persisting any metadata, confirming that the source can be reached and that the recipe configuration produces valid work units. - Test source connection (
--test-source-connection): Invokes the source's dedicated connection test method via theConnectionManager, producing a structured connection report without instantiating the full pipeline. This is the fastest way to validate credentials and network connectivity.
Both mechanisms embody the fail-fast design philosophy: surface configuration errors at the earliest possible moment, with clear diagnostic output, so that users can iterate on their recipes without waiting for a full ingestion cycle.
Usage
Connection Validation is applied in the following scenarios:
- Initial recipe authoring: A user runs
datahub ingest run -c recipe.yml --test-source-connectionto verify that the credentials and connection parameters in a new recipe are correct before attempting a full run. - Dry-run previews: A user runs
datahub ingest run -c recipe.yml --dry-run --preview --preview-workunits 10to extract a small sample of metadata without writing to the sink, confirming that the source produces expected work units. - CI/CD pipeline gates: An automated pipeline runs the connection test as a pre-deployment check, preventing deployment of recipes with invalid configurations.
- Debugging production failures: After a failed ingestion run, an operator replays the recipe with
--dry-runto isolate whether the issue is in the source extraction or the sink delivery.
Theoretical Basis
The Connection Validation principle is rooted in two well-established engineering patterns:
Fail-fast pattern. The fail-fast principle states that a system should detect and report errors as close as possible to the point of failure. In the context of metadata ingestion, a connection failure detected during a 30-second test is far more actionable than the same failure discovered 20 minutes into a full ingestion run. The --test-source-connection flag implements this pattern by invoking only the connection-related code path and returning a structured ConnectionReport.
Dry-run execution pattern. A dry run simulates the complete execution of an operation without committing its side effects. DataHub's --dry-run flag sets dry_run=True on the PipelineContext, which causes the pipeline to skip all sink writes while still executing source extraction, transformation, and work unit generation. This provides confidence that the pipeline's data flow is correct without modifying any external state.
Preview with bounded scope. The --preview flag combined with --preview-workunits limits the number of work units extracted from the source, enabling rapid feedback on the structure and content of the metadata being produced. The pipeline uses itertools.islice to truncate the work unit iterator, ensuring that preview mode terminates quickly regardless of the source's total data volume.
Structured diagnostic output. Both the connection test and the dry run produce machine-readable reports. The connection report can be written to a file via the --report-to flag, and the pipeline summary includes source and sink reports with failure counts, warnings, and timing information. This structured output supports automated monitoring and alerting.