Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Datahub project Datahub Connection Validation

From Leeroopedia


Property Value
Page Type Principle
Workflow Metadata_Ingestion_Pipeline
Concept Pre-flight validation of data source connectivity before ingestion execution
Repository https://github.com/datahub-project/datahub
Implemented By Implementation:Datahub_project_Datahub_Datahub_Ingest_Dry_Run
Last Updated 2026-02-09 17:00 GMT

Overview

Description

The Connection Validation principle ensures that the connectivity and configuration correctness of a data source can be verified before committing to a full ingestion run. In any metadata ingestion pipeline, failure to connect to a data source, whether due to incorrect credentials, network restrictions, or misconfigured parameters, is one of the most common causes of pipeline failure. Detecting these issues early, before the pipeline attempts to extract and load metadata, saves time and resources.

DataHub provides two complementary mechanisms for connection validation:

  1. Dry-run mode (--dry-run): Executes the full pipeline but skips writing records to the sink. This exercises the source connector's extraction logic without persisting any metadata, confirming that the source can be reached and that the recipe configuration produces valid work units.
  2. Test source connection (--test-source-connection): Invokes the source's dedicated connection test method via the ConnectionManager, producing a structured connection report without instantiating the full pipeline. This is the fastest way to validate credentials and network connectivity.

Both mechanisms embody the fail-fast design philosophy: surface configuration errors at the earliest possible moment, with clear diagnostic output, so that users can iterate on their recipes without waiting for a full ingestion cycle.

Usage

Connection Validation is applied in the following scenarios:

  • Initial recipe authoring: A user runs datahub ingest run -c recipe.yml --test-source-connection to verify that the credentials and connection parameters in a new recipe are correct before attempting a full run.
  • Dry-run previews: A user runs datahub ingest run -c recipe.yml --dry-run --preview --preview-workunits 10 to extract a small sample of metadata without writing to the sink, confirming that the source produces expected work units.
  • CI/CD pipeline gates: An automated pipeline runs the connection test as a pre-deployment check, preventing deployment of recipes with invalid configurations.
  • Debugging production failures: After a failed ingestion run, an operator replays the recipe with --dry-run to isolate whether the issue is in the source extraction or the sink delivery.

Theoretical Basis

The Connection Validation principle is rooted in two well-established engineering patterns:

Fail-fast pattern. The fail-fast principle states that a system should detect and report errors as close as possible to the point of failure. In the context of metadata ingestion, a connection failure detected during a 30-second test is far more actionable than the same failure discovered 20 minutes into a full ingestion run. The --test-source-connection flag implements this pattern by invoking only the connection-related code path and returning a structured ConnectionReport.

Dry-run execution pattern. A dry run simulates the complete execution of an operation without committing its side effects. DataHub's --dry-run flag sets dry_run=True on the PipelineContext, which causes the pipeline to skip all sink writes while still executing source extraction, transformation, and work unit generation. This provides confidence that the pipeline's data flow is correct without modifying any external state.

Preview with bounded scope. The --preview flag combined with --preview-workunits limits the number of work units extracted from the source, enabling rapid feedback on the structure and content of the metadata being produced. The pipeline uses itertools.islice to truncate the work unit iterator, ensuring that preview mode terminates quickly regardless of the source's total data volume.

Structured diagnostic output. Both the connection test and the dry run produce machine-readable reports. The connection report can be written to a file via the --report-to flag, and the pipeline summary includes source and sink reports with failure counts, warnings, and timing information. This structured output supports automated monitoring and alerting.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment