Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datahub project Datahub HdfsPathDataset

From Leeroopedia


Knowledge Sources
Domains OpenLineage_Integration, Dataset_Resolution
Last Updated 2026-02-10 00:00 GMT

Overview

Description

HdfsPathDataset is a class that extends SparkDataset to represent datasets identified by HDFS-compatible filesystem paths. It handles the resolution of raw URI paths into DataHub dataset names and platforms, supporting a variety of cloud and local storage systems including S3, GCS, ABFS, WASB, DBFS, local file, and HDFS.

Key capabilities:

  • Platform detection -- Determines the DataHub platform from the URI scheme prefix (e.g., s3:// maps to "s3", gs:// maps to "gcs", abfss:// maps to "abs").
  • Path spec matching -- Matches URI paths against configurable path specifications containing {table} markers and wildcards (*). This enables extraction of meaningful dataset names from structured storage paths.
  • Partition stripping -- Optionally strips partition suffixes from paths using a configurable regular expression.
  • Dataset name normalization -- Strips the URI scheme prefix and leading slashes to produce clean dataset names.

The class includes an internal HdfsPlatform enum (separate from the top-level HdfsPlatform enum) that maps URI scheme prefixes to DataHub platform identifiers.

Usage

Used by the OpenLineage converter to resolve HDFS and cloud storage URIs encountered in Spark lineage events into DataHub dataset entities.

Code Reference

Source Location

metadata-integration/java/openlineage-converter/src/main/java/io/datahubproject/openlineage/dataset/HdfsPathDataset.java

Signature

@ToString
@Slf4j
public class HdfsPathDataset extends SparkDataset {

    public HdfsPathDataset(String platform, String name, String platformInstance,
                           FabricType fabricType, String datasetPath)

    public HdfsPathDataset(String pathUri, String platformInstance, FabricType fabricType)

    public HdfsPathDataset(String pathUri, DatahubOpenlineageConfig datahubConf)

    public HdfsPathDataset(String platform, String name, String datasetPath,
                           DatahubOpenlineageConfig datahubConf)

    public String getDatasetPath()

    public static HdfsPathDataset create(URI path, DatahubOpenlineageConfig datahubConf)
        throws InstantiationException

    static String getMatchedUri(String pathUri, String pathSpec)
}

Import

import io.datahubproject.openlineage.dataset.HdfsPathDataset;

I/O Contract

Inputs

Method Parameter Type Description
create path URI The filesystem URI to resolve into a dataset
create datahubConf DatahubOpenlineageConfig Configuration providing path specs, fabric type, and partition regexp
getMatchedUri pathUri String The URI string to match
getMatchedUri pathSpec String A path spec containing {table} marker

Outputs

Method Return Type Description
create HdfsPathDataset A resolved dataset with platform, name, and path
getDatasetPath() String The original or matched dataset path
getMatchedUri String (nullable) The matched URI up to the {table} segment, or null if no match

Platform Resolution (internal HdfsPlatform enum):

URI Prefixes DataHub Platform
s3, s3a, s3n s3
gs, gcs gcs
abfs, abfss abs
wasb, wasbs abs
dbfs dbfs
file file
(default) hdfs

Usage Examples

// Create from a URI with configuration
URI path = URI.create("s3://my-bucket/warehouse/db/table/partition=1");
DatahubOpenlineageConfig config = ...;
HdfsPathDataset dataset = HdfsPathDataset.create(path, config);
// dataset.getPlatform() -> "s3"
// dataset.getDatasetPath() -> matched path based on path specs

// Path spec matching
String matched = HdfsPathDataset.getMatchedUri(
    "s3://bucket/warehouse/db/my_table/part=1",
    "s3://bucket/warehouse/*/{table}");
// matched -> "s3://bucket/warehouse/db/my_table"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment