Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Datahub project Datahub DataHub UI Lineage Verification

From Leeroopedia


Attribute Value
Page Type Implementation (External Tool Doc)
Workflow Spark_Lineage_Capture
Pair 6 of 6
Principle Principle:Datahub_project_Datahub_Lineage_Verification
Repository https://github.com/datahub-project/datahub
Source Location smoke-test/test_e2e.py:L1-1218
Last Updated 2026-02-09 17:00 GMT

Overview

Description

DataHub UI Lineage Verification encompasses the tools and techniques for validating that Spark lineage metadata has been correctly captured and ingested into the DataHub metadata platform. Verification operates through two complementary channels: the DataHub web interface for visual inspection of lineage graphs, and programmatic testing via the GMS REST API and GraphQL API for automated validation.

The DataHub frontend (React application served at http://localhost:9002 by default) provides a Lineage tab on every dataset and pipeline entity page, rendering a visual graph of upstream and downstream relationships. The smoke test suite (smoke-test/test_e2e.py) provides automated verification patterns that can be adapted for Spark lineage validation, including entity existence checks, aspect retrieval, and relationship traversal.

Usage

Lineage verification is performed after a Spark job has completed and the DataHub Spark agent has emitted its metadata. The verification process confirms that:

  • The DataFlow entity (representing the Spark application) exists in DataHub.
  • The DataJob entity (representing the coalesced task) exists with correct input/output edges.
  • Input dataset entities are linked as upstream of the DataJob.
  • Output dataset entities are linked as downstream of the DataJob.
  • Column-level lineage (if captured) appears in the dataset's UpstreamLineage aspect.
  • Custom properties, tags, and domains are correctly attached.

Code Reference

Source Location

File smoke-test/test_e2e.py
Lines L1-1218
Module smoke-test
Additional DataHub UI at http://localhost:9002

Signature

Visual verification (DataHub UI):

1. Navigate to http://localhost:9002 (or configured frontend URL)
2. Search for dataset by name or URN
3. Open the dataset page
4. Click the "Lineage" tab
5. Inspect the upstream/downstream graph for:
   - Input datasets connected to the DataJob
   - Output datasets connected from the DataJob
   - Column-level lineage edges (if applicable)

Programmatic verification (GMS REST API):

# Check entity existence
response = session.get(
    f"{gms_url}/entitiesV2?ids=List({urllib.parse.quote(urn)})"
    f"&aspects=List(datasetProperties)",
    headers={
        "X-RestLi-Protocol-Version": "2.0.0",
        "X-RestLi-Method": "batch_get",
    },
)
response.raise_for_status()
data = response.json()
assert data["results"][urn]

Programmatic verification (GraphQL API):

# Query lineage via GraphQL
query = """
query lineage($urn: String!) {
    dataset(urn: $urn) {
        urn
        properties {
            name
        }
        lineage(input: { direction: UPSTREAM, start: 0, count: 10 }) {
            total
            relationships {
                entity {
                    urn
                    type
                }
            }
        }
    }
}
"""
variables = {"urn": dataset_urn}
result = execute_graphql(session, query, variables)
assert result["data"]["dataset"]["lineage"]["total"] > 0

Import

# For programmatic verification in smoke tests
from tests.utils import (
    execute_graphql,
    get_admin_credentials,
    get_frontend_session,
    ingest_file_via_rest,
    wait_for_writes_to_sync,
)

I/O Contract

Direction Type Description
Input Dataset URN (String) The URN of a dataset to verify, e.g., urn:li:dataset:(urn:li:dataPlatform:hive,db.table,PROD).
Input DataFlow URN (String) The URN of the Spark pipeline to verify, e.g., urn:li:dataFlow:(spark,my-app,default).
Input DataJob URN (String) The URN of the Spark task to verify, e.g., urn:li:dataJob:(urn:li:dataFlow:(spark,my-app,default),my-app).
Output (Visual) Lineage Graph A visual directed graph in the DataHub UI showing input datasets, the DataJob node, and output datasets with connecting edges.
Output (Programmatic) JSON Response API responses containing entity aspects, relationship counts, and lineage edges that can be asserted in automated tests.

Usage Examples

Example 1: Visual verification workflow

1. Run Spark job with DataHub agent configured:
   spark-submit --packages io.acryl:acryl-spark-lineage_2.12:0.2.18 \
     --conf "spark.extraListeners=datahub.spark.DatahubSparkListener" \
     --conf "spark.datahub.rest.server=http://localhost:8080" \
     etl_job.py

2. Wait for job completion and metadata ingestion.

3. Open browser to http://localhost:9002

4. Search for "output_table" in the DataHub search bar.

5. Click on the dataset entity.

6. Navigate to the "Lineage" tab.

7. Verify:
   - The DataJob "etl_job" appears as an upstream node.
   - The input datasets (e.g., "input_table_a", "input_table_b") appear
     as upstream of the DataJob.
   - The output dataset ("output_table") appears as downstream of the DataJob.

Example 2: Programmatic entity existence check

import urllib.parse
import requests

gms_url = "http://localhost:8080"
dataset_urn = "urn:li:dataset:(urn:li:dataPlatform:hive,mydb.output_table,PROD)"

response = requests.get(
    f"{gms_url}/entitiesV2"
    f"?ids=List({urllib.parse.quote(dataset_urn)})"
    f"&aspects=List(upstreamLineage)",
    headers={
        "X-RestLi-Protocol-Version": "2.0.0",
        "X-RestLi-Method": "batch_get",
    },
)
response.raise_for_status()
data = response.json()

# Verify entity exists
assert dataset_urn in data["results"]

# Verify upstream lineage aspect is present
assert "upstreamLineage" in data["results"][dataset_urn]["aspects"]

Example 3: Smoke test pattern for DataFlow verification

from tests.utils import execute_graphql, get_frontend_session

session = get_frontend_session()
flow_urn = "urn:li:dataFlow:(spark,my-etl-app,default)"

query = """
query dataFlow($urn: String!) {
    dataFlow(urn: $urn) {
        urn
        properties {
            name
        }
        childJobs: relationships(
            input: { types: ["IsPartOf"], direction: INCOMING, start: 0, count: 10 }
        ) {
            total
            relationships {
                entity {
                    urn
                }
            }
        }
    }
}
"""

result = execute_graphql(session, query, {"urn": flow_urn})
assert result["data"]["dataFlow"] is not None
assert result["data"]["dataFlow"]["properties"]["name"] == "my-etl-app"
assert result["data"]["dataFlow"]["childJobs"]["total"] >= 1

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment