Implementation:Datahub project Datahub HdfsPathDataset
| Knowledge Sources | |
|---|---|
| Domains | OpenLineage_Integration, Dataset_Resolution |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Description
HdfsPathDataset is a class that extends SparkDataset to represent datasets identified by HDFS-compatible filesystem paths. It handles the resolution of raw URI paths into DataHub dataset names and platforms, supporting a variety of cloud and local storage systems including S3, GCS, ABFS, WASB, DBFS, local file, and HDFS.
Key capabilities:
- Platform detection -- Determines the DataHub platform from the URI scheme prefix (e.g.,
s3://maps to"s3",gs://maps to"gcs",abfss://maps to"abs"). - Path spec matching -- Matches URI paths against configurable path specifications containing
{table}markers and wildcards (*). This enables extraction of meaningful dataset names from structured storage paths. - Partition stripping -- Optionally strips partition suffixes from paths using a configurable regular expression.
- Dataset name normalization -- Strips the URI scheme prefix and leading slashes to produce clean dataset names.
The class includes an internal HdfsPlatform enum (separate from the top-level HdfsPlatform enum) that maps URI scheme prefixes to DataHub platform identifiers.
Usage
Used by the OpenLineage converter to resolve HDFS and cloud storage URIs encountered in Spark lineage events into DataHub dataset entities.
Code Reference
Source Location
metadata-integration/java/openlineage-converter/src/main/java/io/datahubproject/openlineage/dataset/HdfsPathDataset.java
Signature
@ToString
@Slf4j
public class HdfsPathDataset extends SparkDataset {
public HdfsPathDataset(String platform, String name, String platformInstance,
FabricType fabricType, String datasetPath)
public HdfsPathDataset(String pathUri, String platformInstance, FabricType fabricType)
public HdfsPathDataset(String pathUri, DatahubOpenlineageConfig datahubConf)
public HdfsPathDataset(String platform, String name, String datasetPath,
DatahubOpenlineageConfig datahubConf)
public String getDatasetPath()
public static HdfsPathDataset create(URI path, DatahubOpenlineageConfig datahubConf)
throws InstantiationException
static String getMatchedUri(String pathUri, String pathSpec)
}
Import
import io.datahubproject.openlineage.dataset.HdfsPathDataset;
I/O Contract
Inputs
| Method | Parameter | Type | Description |
|---|---|---|---|
create |
path |
URI |
The filesystem URI to resolve into a dataset |
create |
datahubConf |
DatahubOpenlineageConfig |
Configuration providing path specs, fabric type, and partition regexp |
getMatchedUri |
pathUri |
String |
The URI string to match |
getMatchedUri |
pathSpec |
String |
A path spec containing {table} marker
|
Outputs
| Method | Return Type | Description |
|---|---|---|
create |
HdfsPathDataset |
A resolved dataset with platform, name, and path |
getDatasetPath() |
String |
The original or matched dataset path |
getMatchedUri |
String (nullable) |
The matched URI up to the {table} segment, or null if no match
|
Platform Resolution (internal HdfsPlatform enum):
| URI Prefixes | DataHub Platform |
|---|---|
s3, s3a, s3n |
s3
|
gs, gcs |
gcs
|
abfs, abfss |
abs
|
wasb, wasbs |
abs
|
dbfs |
dbfs
|
file |
file
|
| (default) | hdfs
|
Usage Examples
// Create from a URI with configuration
URI path = URI.create("s3://my-bucket/warehouse/db/table/partition=1");
DatahubOpenlineageConfig config = ...;
HdfsPathDataset dataset = HdfsPathDataset.create(path, config);
// dataset.getPlatform() -> "s3"
// dataset.getDatasetPath() -> matched path based on path specs
// Path spec matching
String matched = HdfsPathDataset.getMatchedUri(
"s3://bucket/warehouse/db/my_table/part=1",
"s3://bucket/warehouse/*/{table}");
// matched -> "s3://bucket/warehouse/db/my_table"
Related Pages
- Datahub_project_Datahub_HdfsPlatform -- Top-level platform enum used for prefix detection