Implementation:Datahub project Datahub SparkPathUtils

Knowledge Sources	Datahub_project_Datahub
Domains	Spark_Lineage, OpenLineage
Last Updated	2026-02-10 00:00 GMT

Overview

PathUtils is a utility class in the io.openlineage.spark.agent.util package that converts filesystem paths, URIs, and Spark catalog table references into OpenLineage DatasetIdentifier objects. It serves as the central path resolution layer for the Spark lineage integration, handling Hadoop paths, Hive metastore tables, AWS Glue catalog tables, and warehouse-relative table locations.

The class supports symlink-based dataset identification, where a physical storage location is linked to a logical table identifier in a metastore (Hive or Glue), enabling DataHub to track both the physical and logical representations of a dataset.

Source file: metadata-integration/java/acryl-spark-lineage/src/main/java/io/openlineage/spark/agent/util/PathUtils.java (191 lines)

Code Reference

Class Declaration

@Slf4j
@SuppressWarnings("PMD.AvoidLiteralsInIfCondition")
public class PathUtils {
  private static final String DEFAULT_DB = "default";
  public static final String GLUE_TABLE_PREFIX = "table/";

Key Methods

fromPath

public static DatasetIdentifier fromPath(Path path)

Converts a Hadoop Path to a DatasetIdentifier by delegating to fromURI.

fromURI

public static DatasetIdentifier fromURI(URI location)

Converts a URI to a DatasetIdentifier using the OpenLineage FilesystemDatasetUtils.fromLocation utility. This is the fundamental building block used throughout the lineage integration.

fromCatalogTable (three overloads)

public static DatasetIdentifier fromCatalogTable(
    CatalogTable catalogTable, SparkSession sparkSession)

public static DatasetIdentifier fromCatalogTable(
    CatalogTable catalogTable, SparkSession sparkSession, Path location)

public static DatasetIdentifier fromCatalogTable(
    CatalogTable catalogTable, SparkSession sparkSession, URI location)

The primary overload (accepting CatalogTable and SparkSession) resolves the storage location URI from the catalog table metadata, falling back to the default table path. The URI-based overload then performs:

URL normalization via FilesystemDatasetUtils
Symlink resolution based on the catalog backend:
- AWS Glue: Creates a symlink with ARN format arn:aws:glue:{region}:{account_id}:table/{database}/{table}
- Hive Metastore: Creates a symlink using the Hive metastore URI and table name
- Warehouse-relative: Creates a symlink using the warehouse location and table name
Attaches the symlink to the DatasetIdentifier as SymlinkType.TABLE

getDefaultLocationUri

public static URI getDefaultLocationUri(SparkSession sparkSession, TableIdentifier identifier)

Returns the default storage path for a table by querying the Spark session catalog.

reconstructDefaultLocation

public static Path reconstructDefaultLocation(String warehouse, String[] namespace, String name)

Reconstructs the default table path from warehouse root, namespace, and table name. Follows the convention /warehouse/mydb.db/mytable for non-default databases and /warehouse/mytable for the default database.

prepareHiveUri

public static URI prepareHiveUri(URI uri)

Normalizes a Hive metastore URI to the format hive://authority, stripping path, query, and fragment components.

I/O Contract

Direction	Type	Description
Input	`Path` / `URI`	A Hadoop filesystem path or URI representing a dataset location.
Input	`CatalogTable` + `SparkSession`	A Spark catalog table with its associated session for metastore resolution.
Output	`DatasetIdentifier`	An OpenLineage dataset identifier containing namespace, name, and optional symlinks for metastore-registered tables.

Usage Examples

Resolving a simple filesystem path:

URI location = URI.create("hdfs://namenode:8020/data/warehouse/my_table");
DatasetIdentifier id = PathUtils.fromURI(location);
// id.getNamespace() -> "hdfs://namenode:8020"
// id.getName() -> "/data/warehouse/my_table"

Resolving a Hive catalog table:

CatalogTable table = sparkSession.sessionState().catalog().getTableMetadata(tableIdent);
DatasetIdentifier id = PathUtils.fromCatalogTable(table, sparkSession);
// Returns DatasetIdentifier with physical location + Hive symlink

Related Pages

Datahub_project_Datahub_RddPathUtils - RDD-level path extraction that feeds into PathUtils for identifier creation
Datahub_project_Datahub_RemovePathPatternUtils - Post-processing of dataset names derived from PathUtils
Datahub_project_Datahub_WriteToDataSourceV2Visitor - Uses PathUtils.fromURI for streaming output dataset identification
Datahub_project_Datahub_StreamingDataSourceV2RelationVisitor - Uses path utilities indirectly via stream strategies
Datahub_project_Datahub_SparkStreamingEventToDatahub - Uses HdfsPathDataset for path-based URN generation in streaming

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment