Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Datahub project Datahub SparkPathUtils

From Leeroopedia


Knowledge Sources
Domains Spark_Lineage, OpenLineage
Last Updated 2026-02-10 00:00 GMT

Overview

PathUtils is a utility class in the io.openlineage.spark.agent.util package that converts filesystem paths, URIs, and Spark catalog table references into OpenLineage DatasetIdentifier objects. It serves as the central path resolution layer for the Spark lineage integration, handling Hadoop paths, Hive metastore tables, AWS Glue catalog tables, and warehouse-relative table locations.

The class supports symlink-based dataset identification, where a physical storage location is linked to a logical table identifier in a metastore (Hive or Glue), enabling DataHub to track both the physical and logical representations of a dataset.

Source file: metadata-integration/java/acryl-spark-lineage/src/main/java/io/openlineage/spark/agent/util/PathUtils.java (191 lines)

Code Reference

Class Declaration

@Slf4j
@SuppressWarnings("PMD.AvoidLiteralsInIfCondition")
public class PathUtils {
  private static final String DEFAULT_DB = "default";
  public static final String GLUE_TABLE_PREFIX = "table/";

Key Methods

fromPath

public static DatasetIdentifier fromPath(Path path)

Converts a Hadoop Path to a DatasetIdentifier by delegating to fromURI.

fromURI

public static DatasetIdentifier fromURI(URI location)

Converts a URI to a DatasetIdentifier using the OpenLineage FilesystemDatasetUtils.fromLocation utility. This is the fundamental building block used throughout the lineage integration.

fromCatalogTable (three overloads)

public static DatasetIdentifier fromCatalogTable(
    CatalogTable catalogTable, SparkSession sparkSession)

public static DatasetIdentifier fromCatalogTable(
    CatalogTable catalogTable, SparkSession sparkSession, Path location)

public static DatasetIdentifier fromCatalogTable(
    CatalogTable catalogTable, SparkSession sparkSession, URI location)

The primary overload (accepting CatalogTable and SparkSession) resolves the storage location URI from the catalog table metadata, falling back to the default table path. The URI-based overload then performs:

  1. URL normalization via FilesystemDatasetUtils
  2. Symlink resolution based on the catalog backend:
    • AWS Glue: Creates a symlink with ARN format arn:aws:glue:{region}:{account_id}:table/{database}/{table}
    • Hive Metastore: Creates a symlink using the Hive metastore URI and table name
    • Warehouse-relative: Creates a symlink using the warehouse location and table name
  3. Attaches the symlink to the DatasetIdentifier as SymlinkType.TABLE

getDefaultLocationUri

public static URI getDefaultLocationUri(SparkSession sparkSession, TableIdentifier identifier)

Returns the default storage path for a table by querying the Spark session catalog.

reconstructDefaultLocation

public static Path reconstructDefaultLocation(String warehouse, String[] namespace, String name)

Reconstructs the default table path from warehouse root, namespace, and table name. Follows the convention /warehouse/mydb.db/mytable for non-default databases and /warehouse/mytable for the default database.

prepareHiveUri

public static URI prepareHiveUri(URI uri)

Normalizes a Hive metastore URI to the format hive://authority, stripping path, query, and fragment components.

I/O Contract

Direction Type Description
Input Path / URI A Hadoop filesystem path or URI representing a dataset location.
Input CatalogTable + SparkSession A Spark catalog table with its associated session for metastore resolution.
Output DatasetIdentifier An OpenLineage dataset identifier containing namespace, name, and optional symlinks for metastore-registered tables.

Usage Examples

Resolving a simple filesystem path:

URI location = URI.create("hdfs://namenode:8020/data/warehouse/my_table");
DatasetIdentifier id = PathUtils.fromURI(location);
// id.getNamespace() -> "hdfs://namenode:8020"
// id.getName() -> "/data/warehouse/my_table"

Resolving a Hive catalog table:

CatalogTable table = sparkSession.sessionState().catalog().getTableMetadata(tableIdent);
DatasetIdentifier id = PathUtils.fromCatalogTable(table, sparkSession);
// Returns DatasetIdentifier with physical location + Hive symlink

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment