Implementation:Datahub project Datahub SparkPathUtils
| Knowledge Sources | |
|---|---|
| Domains | Spark_Lineage, OpenLineage |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
PathUtils is a utility class in the io.openlineage.spark.agent.util package that converts filesystem paths, URIs, and Spark catalog table references into OpenLineage DatasetIdentifier objects. It serves as the central path resolution layer for the Spark lineage integration, handling Hadoop paths, Hive metastore tables, AWS Glue catalog tables, and warehouse-relative table locations.
The class supports symlink-based dataset identification, where a physical storage location is linked to a logical table identifier in a metastore (Hive or Glue), enabling DataHub to track both the physical and logical representations of a dataset.
Source file: metadata-integration/java/acryl-spark-lineage/src/main/java/io/openlineage/spark/agent/util/PathUtils.java (191 lines)
Code Reference
Class Declaration
@Slf4j
@SuppressWarnings("PMD.AvoidLiteralsInIfCondition")
public class PathUtils {
private static final String DEFAULT_DB = "default";
public static final String GLUE_TABLE_PREFIX = "table/";
Key Methods
fromPath
public static DatasetIdentifier fromPath(Path path)
Converts a Hadoop Path to a DatasetIdentifier by delegating to fromURI.
fromURI
public static DatasetIdentifier fromURI(URI location)
Converts a URI to a DatasetIdentifier using the OpenLineage FilesystemDatasetUtils.fromLocation utility. This is the fundamental building block used throughout the lineage integration.
fromCatalogTable (three overloads)
public static DatasetIdentifier fromCatalogTable(
CatalogTable catalogTable, SparkSession sparkSession)
public static DatasetIdentifier fromCatalogTable(
CatalogTable catalogTable, SparkSession sparkSession, Path location)
public static DatasetIdentifier fromCatalogTable(
CatalogTable catalogTable, SparkSession sparkSession, URI location)
The primary overload (accepting CatalogTable and SparkSession) resolves the storage location URI from the catalog table metadata, falling back to the default table path. The URI-based overload then performs:
- URL normalization via
FilesystemDatasetUtils - Symlink resolution based on the catalog backend:
- AWS Glue: Creates a symlink with ARN format
arn:aws:glue:{region}:{account_id}:table/{database}/{table} - Hive Metastore: Creates a symlink using the Hive metastore URI and table name
- Warehouse-relative: Creates a symlink using the warehouse location and table name
- AWS Glue: Creates a symlink with ARN format
- Attaches the symlink to the
DatasetIdentifierasSymlinkType.TABLE
getDefaultLocationUri
public static URI getDefaultLocationUri(SparkSession sparkSession, TableIdentifier identifier)
Returns the default storage path for a table by querying the Spark session catalog.
reconstructDefaultLocation
public static Path reconstructDefaultLocation(String warehouse, String[] namespace, String name)
Reconstructs the default table path from warehouse root, namespace, and table name. Follows the convention /warehouse/mydb.db/mytable for non-default databases and /warehouse/mytable for the default database.
prepareHiveUri
public static URI prepareHiveUri(URI uri)
Normalizes a Hive metastore URI to the format hive://authority, stripping path, query, and fragment components.
I/O Contract
| Direction | Type | Description |
|---|---|---|
| Input | Path / URI |
A Hadoop filesystem path or URI representing a dataset location. |
| Input | CatalogTable + SparkSession |
A Spark catalog table with its associated session for metastore resolution. |
| Output | DatasetIdentifier |
An OpenLineage dataset identifier containing namespace, name, and optional symlinks for metastore-registered tables. |
Usage Examples
Resolving a simple filesystem path:
URI location = URI.create("hdfs://namenode:8020/data/warehouse/my_table");
DatasetIdentifier id = PathUtils.fromURI(location);
// id.getNamespace() -> "hdfs://namenode:8020"
// id.getName() -> "/data/warehouse/my_table"
Resolving a Hive catalog table:
CatalogTable table = sparkSession.sessionState().catalog().getTableMetadata(tableIdent); DatasetIdentifier id = PathUtils.fromCatalogTable(table, sparkSession); // Returns DatasetIdentifier with physical location + Hive symlink
Related Pages
- Datahub_project_Datahub_RddPathUtils - RDD-level path extraction that feeds into PathUtils for identifier creation
- Datahub_project_Datahub_RemovePathPatternUtils - Post-processing of dataset names derived from PathUtils
- Datahub_project_Datahub_WriteToDataSourceV2Visitor - Uses
PathUtils.fromURIfor streaming output dataset identification - Datahub_project_Datahub_StreamingDataSourceV2RelationVisitor - Uses path utilities indirectly via stream strategies
- Datahub_project_Datahub_SparkStreamingEventToDatahub - Uses
HdfsPathDatasetfor path-based URN generation in streaming