Principle:Datahub project Datahub OpenLineage Conversion
| Attribute | Value |
|---|---|
| Page Type | Principle |
| Workflow | Spark_Lineage_Capture |
| Pair | 4 of 6 |
| Implementation | Implementation:Datahub_project_Datahub_OpenLineageToDataHub_ConvertRunEvent |
| Repository | https://github.com/datahub-project/datahub |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Description
OpenLineage Conversion is the principle of translating standardized, vendor-neutral lineage events into a platform-specific metadata model. In DataHub's Spark integration, the OpenLineage standard provides a canonical representation of data lineage -- with concepts like RunEvents, Jobs, InputDatasets, and OutputDatasets -- that must be mapped to DataHub's entity model of DataFlows, DataJobs, Datasets, and their associated aspects (ownership, schema, lineage edges, tags, domains).
This conversion layer serves as an adapter between two metadata schemas. The OpenLineage side represents lineage as it is captured by the Spark execution engine, while the DataHub side represents lineage as it is stored, queried, and visualized in the metadata platform. The conversion handles namespace-to-platform mapping, dataset name normalization, URN construction, fine-grained column-level lineage translation, and the extraction of custom properties from OpenLineage facets.
Usage
The OpenLineage Conversion principle is applied at the core of the Spark lineage pipeline, after the OpenLineage Spark library has generated a RunEvent from Spark execution plan analysis and before the converted metadata is emitted to DataHub. The conversion handles:
- Job mapping: An OpenLineage
RunEventis converted to aDatahubJobcontaining aDataFlowUrn(the pipeline), aDataJobUrn(the task),DataFlowInfo, andDataJobInfowith custom properties. - Dataset mapping: OpenLineage input/output datasets are converted to DataHub
DatasetUrnvalues by parsing the namespace (which determines platform) and name (which determines the dataset path). Symlink resolution maps file-system paths to catalog table names when available. - Column-level lineage: OpenLineage
ColumnLineageDatasetFacetentries are translated to DataHubFineGrainedLineagearrays with upstream/downstream schema field URNs and transformation operations. - Ownership and tags: OpenLineage job ownership facets are mapped to DataHub
Ownershipaspects, and Airflow DAG tags are mapped to DataHubGlobalTags. - Custom properties: Processing engine details, Spark version, job IDs, and logical plans are extracted from OpenLineage run facets and stored as DataHub custom properties.
Theoretical Basis
The OpenLineage Conversion principle draws on several foundational software engineering concepts:
Adapter pattern: The conversion layer acts as a structural adapter between two incompatible interfaces -- the OpenLineage event model and the DataHub entity model. The adapter translates method calls and data structures from one interface to the other without modifying either side. This enables the Spark integration to benefit from the OpenLineage ecosystem (which provides plan analysis for multiple Spark versions) while producing output that conforms to DataHub's specific metadata schema.
Schema mapping between metadata formats: The conversion must handle fundamental differences in how the two systems model metadata:
- Namespace to Platform: OpenLineage uses URI-based namespaces (e.g.,
s3://bucket,hive) that must be parsed into DataHub platform identifiers (e.g.,s3,hive). - Flat names to structured URNs: OpenLineage dataset names are flat strings that must be transformed into DataHub's URN format:
urn:li:dataset:(urn:li:dataPlatform:<platform>,<name>,<env>). - Facets to Aspects: OpenLineage attaches metadata through extensible facets, while DataHub uses typed aspects. The conversion maps specific facets (schema, column lineage, symlinks) to their corresponding DataHub aspects.
URN construction strategy: DataHub entities are identified by URNs following the pattern urn:li:<entityType>:<key>. The conversion layer must construct valid URNs that are consistent with URNs produced by other DataHub ingestion sources. This is why configuration options like platformInstance, env, hivePlatformAlias, and lowerCaseUrns exist -- they ensure that a dataset ingested by the Spark agent has the same URN as the same dataset ingested by a platform-specific source (e.g., the Hive or S3 ingestion plugin).
Symlink resolution: When datasets have both a physical storage path (e.g., an S3 location) and a logical catalog identity (e.g., a Hive table name), the conversion layer resolves symlinks to prefer the catalog identity. This produces URNs that match what users expect to see and what other ingestion sources produce, enabling lineage graphs to connect properly across different metadata sources.