Principle:Datahub project Datahub Lineage Verification
| Attribute | Value |
|---|---|
| Page Type | Principle |
| Workflow | Spark_Lineage_Capture |
| Pair | 6 of 6 |
| Implementation | Implementation:Datahub_project_Datahub_DataHub_UI_Lineage_Verification |
| Repository | https://github.com/datahub-project/datahub |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Description
Lineage Verification is the principle of validating that captured lineage relationships are correctly represented in the metadata platform through both visual inspection and programmatic testing. After the Spark agent captures and emits lineage metadata, verification ensures that the resulting lineage graph in DataHub accurately reflects the actual data flow: input datasets are connected to the correct DataJob, the DataJob is connected to output datasets, and the relationships are navigable through the DataHub UI and API.
Verification operates at two levels: visual verification through the DataHub web interface, where users navigate to dataset or pipeline pages and inspect the Lineage tab to confirm that upstream and downstream relationships appear correctly; and programmatic verification through automated smoke tests and API calls that assert the existence and correctness of lineage edges.
Usage
Lineage Verification is applied as the final step in the Spark lineage capture workflow, after the agent has emitted metadata to DataHub. It serves several purposes:
- Development validation: When developing or modifying the Spark agent, verification confirms that changes produce correct lineage output. Developers run a Spark job, then check the DataHub UI or query the GMS API to inspect results.
- Smoke testing: The DataHub project includes end-to-end smoke tests (in
smoke-test/test_e2e.py) that exercise the full ingestion pipeline and verify that entities and relationships exist in the metadata store. - Production monitoring: In production environments, teams can query the DataHub API to programmatically verify that expected lineage relationships exist after scheduled Spark jobs complete.
Verification methods include:
- DataHub UI: Navigate to
http://localhost:9002(or the configured DataHub frontend URL), search for a dataset by name or URN, open the Lineage tab, and visually inspect the upstream/downstream graph. - GMS REST API: Query the
/entitiesor/entitiesV2endpoints to retrieve entity aspects and verify that lineage-related aspects (upstreamLineage,dataJobInputOutput) contain the expected dataset URNs. - GraphQL API: Execute GraphQL queries against DataHub to traverse lineage relationships programmatically, including pagination through large lineage graphs.
Theoretical Basis
The Lineage Verification principle is grounded in the following theoretical foundations:
End-to-end testing of metadata pipelines: Metadata pipelines are inherently complex, involving multiple transformation stages (Spark plan analysis, OpenLineage event generation, DataHub conversion, MCP emission, GMS ingestion, graph indexing). An error at any stage can produce silently incorrect lineage. End-to-end verification is the only reliable way to confirm correctness across the entire pipeline.
Verification pattern for data lineage graphs: Lineage graphs have properties that can be systematically verified:
- Structural correctness: Every input dataset should have an edge to the DataJob, and every output dataset should have an edge from the DataJob.
- Referential integrity: All URNs referenced in lineage edges should resolve to existing entities in the metadata store.
- Completeness: The set of datasets in the lineage graph should match the set of tables actually read from and written to by the Spark job.
- Idempotency: Running the same Spark job multiple times should produce the same lineage graph (assuming coalescing is enabled).
Two-level verification: Visual verification through the UI catches issues that automated tests might miss (e.g., rendering problems, unexpected URN formatting that produces unlinked nodes), while programmatic verification catches issues that visual inspection might miss (e.g., missing column-level lineage, incorrect custom properties, subtle URN mismatches).
Feedback loop: Verification closes the feedback loop in the metadata capture workflow. Without verification, issues in agent configuration, network connectivity, or conversion logic may go undetected, leading to incomplete or incorrect lineage graphs that undermine trust in the metadata platform.