Implementation:Datahub project Datahub DataHub UI Lineage Verification
| Attribute | Value |
|---|---|
| Page Type | Implementation (External Tool Doc) |
| Workflow | Spark_Lineage_Capture |
| Pair | 6 of 6 |
| Principle | Principle:Datahub_project_Datahub_Lineage_Verification |
| Repository | https://github.com/datahub-project/datahub |
| Source Location | smoke-test/test_e2e.py:L1-1218 |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Description
DataHub UI Lineage Verification encompasses the tools and techniques for validating that Spark lineage metadata has been correctly captured and ingested into the DataHub metadata platform. Verification operates through two complementary channels: the DataHub web interface for visual inspection of lineage graphs, and programmatic testing via the GMS REST API and GraphQL API for automated validation.
The DataHub frontend (React application served at http://localhost:9002 by default) provides a Lineage tab on every dataset and pipeline entity page, rendering a visual graph of upstream and downstream relationships. The smoke test suite (smoke-test/test_e2e.py) provides automated verification patterns that can be adapted for Spark lineage validation, including entity existence checks, aspect retrieval, and relationship traversal.
Usage
Lineage verification is performed after a Spark job has completed and the DataHub Spark agent has emitted its metadata. The verification process confirms that:
- The DataFlow entity (representing the Spark application) exists in DataHub.
- The DataJob entity (representing the coalesced task) exists with correct input/output edges.
- Input dataset entities are linked as upstream of the DataJob.
- Output dataset entities are linked as downstream of the DataJob.
- Column-level lineage (if captured) appears in the dataset's UpstreamLineage aspect.
- Custom properties, tags, and domains are correctly attached.
Code Reference
Source Location
| File | smoke-test/test_e2e.py
|
| Lines | L1-1218 |
| Module | smoke-test
|
| Additional | DataHub UI at http://localhost:9002
|
Signature
Visual verification (DataHub UI):
1. Navigate to http://localhost:9002 (or configured frontend URL)
2. Search for dataset by name or URN
3. Open the dataset page
4. Click the "Lineage" tab
5. Inspect the upstream/downstream graph for:
- Input datasets connected to the DataJob
- Output datasets connected from the DataJob
- Column-level lineage edges (if applicable)
Programmatic verification (GMS REST API):
# Check entity existence
response = session.get(
f"{gms_url}/entitiesV2?ids=List({urllib.parse.quote(urn)})"
f"&aspects=List(datasetProperties)",
headers={
"X-RestLi-Protocol-Version": "2.0.0",
"X-RestLi-Method": "batch_get",
},
)
response.raise_for_status()
data = response.json()
assert data["results"][urn]
Programmatic verification (GraphQL API):
# Query lineage via GraphQL
query = """
query lineage($urn: String!) {
dataset(urn: $urn) {
urn
properties {
name
}
lineage(input: { direction: UPSTREAM, start: 0, count: 10 }) {
total
relationships {
entity {
urn
type
}
}
}
}
}
"""
variables = {"urn": dataset_urn}
result = execute_graphql(session, query, variables)
assert result["data"]["dataset"]["lineage"]["total"] > 0
Import
# For programmatic verification in smoke tests
from tests.utils import (
execute_graphql,
get_admin_credentials,
get_frontend_session,
ingest_file_via_rest,
wait_for_writes_to_sync,
)
I/O Contract
| Direction | Type | Description |
|---|---|---|
| Input | Dataset URN (String) | The URN of a dataset to verify, e.g., urn:li:dataset:(urn:li:dataPlatform:hive,db.table,PROD).
|
| Input | DataFlow URN (String) | The URN of the Spark pipeline to verify, e.g., urn:li:dataFlow:(spark,my-app,default).
|
| Input | DataJob URN (String) | The URN of the Spark task to verify, e.g., urn:li:dataJob:(urn:li:dataFlow:(spark,my-app,default),my-app).
|
| Output (Visual) | Lineage Graph | A visual directed graph in the DataHub UI showing input datasets, the DataJob node, and output datasets with connecting edges. |
| Output (Programmatic) | JSON Response | API responses containing entity aspects, relationship counts, and lineage edges that can be asserted in automated tests. |
Usage Examples
Example 1: Visual verification workflow
1. Run Spark job with DataHub agent configured:
spark-submit --packages io.acryl:acryl-spark-lineage_2.12:0.2.18 \
--conf "spark.extraListeners=datahub.spark.DatahubSparkListener" \
--conf "spark.datahub.rest.server=http://localhost:8080" \
etl_job.py
2. Wait for job completion and metadata ingestion.
3. Open browser to http://localhost:9002
4. Search for "output_table" in the DataHub search bar.
5. Click on the dataset entity.
6. Navigate to the "Lineage" tab.
7. Verify:
- The DataJob "etl_job" appears as an upstream node.
- The input datasets (e.g., "input_table_a", "input_table_b") appear
as upstream of the DataJob.
- The output dataset ("output_table") appears as downstream of the DataJob.
Example 2: Programmatic entity existence check
import urllib.parse
import requests
gms_url = "http://localhost:8080"
dataset_urn = "urn:li:dataset:(urn:li:dataPlatform:hive,mydb.output_table,PROD)"
response = requests.get(
f"{gms_url}/entitiesV2"
f"?ids=List({urllib.parse.quote(dataset_urn)})"
f"&aspects=List(upstreamLineage)",
headers={
"X-RestLi-Protocol-Version": "2.0.0",
"X-RestLi-Method": "batch_get",
},
)
response.raise_for_status()
data = response.json()
# Verify entity exists
assert dataset_urn in data["results"]
# Verify upstream lineage aspect is present
assert "upstreamLineage" in data["results"][dataset_urn]["aspects"]
Example 3: Smoke test pattern for DataFlow verification
from tests.utils import execute_graphql, get_frontend_session
session = get_frontend_session()
flow_urn = "urn:li:dataFlow:(spark,my-etl-app,default)"
query = """
query dataFlow($urn: String!) {
dataFlow(urn: $urn) {
urn
properties {
name
}
childJobs: relationships(
input: { types: ["IsPartOf"], direction: INCOMING, start: 0, count: 10 }
) {
total
relationships {
entity {
urn
}
}
}
}
}
"""
result = execute_graphql(session, query, {"urn": flow_urn})
assert result["data"]["dataFlow"] is not None
assert result["data"]["dataFlow"]["properties"]["name"] == "my-etl-app"
assert result["data"]["dataFlow"]["childJobs"]["total"] >= 1