Implementation:Datahub project Datahub Dataset Builder
| Field | Value |
|---|---|
| Implementation Name | Dataset Builder |
| Type | API Doc |
| Status | Active |
| Last Updated | 2026-02-10 |
| Repository | Datahub_project_Datahub |
| Source File | metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/entity/Dataset.java (Lines 490-688)
|
Overview
The Dataset.Builder class provides a fluent API for constructing Dataset entity instances. It enforces required fields (platform, name), generates DataHub URNs automatically, and caches builder-provided properties as aspects for emission during upsert.
Import Statement
import datahub.client.v2.entity.Dataset;
Source Reference
File: metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/entity/Dataset.java (Lines 490-688)
public static class Builder {
private String platform;
private String name;
private String env = "PROD";
private String platformInstance;
private String description;
private String displayName;
private List<SchemaField> schemaFields;
private Map<String, String> customProperties;
@Nonnull
public Builder platform(@Nonnull String platform) { ... }
@Nonnull
public Builder name(@Nonnull String name) { ... }
@Nonnull
public Builder env(@Nonnull String env) { ... }
@Nonnull
public Builder platformInstance(@Nullable String platformInstance) { ... }
@Nonnull
public Builder description(@Nullable String description) { ... }
@Nonnull
public Builder displayName(@Nullable String displayName) { ... }
@Nonnull
public Builder schemaFields(@Nullable List<SchemaField> fields) { ... }
@Nonnull
public Builder customProperties(@Nullable Map<String, String> properties) { ... }
@Nonnull
public Builder urn(@Nonnull DatasetUrn urn) { ... }
@Nonnull
public Dataset build() { ... }
}
Builder Methods
| Method | Parameter | Required | Default | Description |
|---|---|---|---|---|
platform(String) |
Platform name | Yes | -- | Data platform identifier (e.g., "snowflake", "bigquery", "kafka") |
name(String) |
Dataset name | Yes | -- | Fully qualified dataset name (e.g., "my_database.my_schema.my_table") |
env(String) |
Environment | No | "PROD" |
Environment identifier (e.g., "PROD", "DEV", "QA") |
platformInstance(String) |
Instance name | No | null |
Platform instance for multi-instance platforms |
description(String) |
Description | No | null |
Dataset description text |
displayName(String) |
Display name | No | null |
Human-readable display name |
schemaFields(List) |
Schema fields | No | null |
List of SchemaField objects defining the dataset schema
|
customProperties(Map) |
Properties map | No | null |
Key-value pairs of custom metadata properties |
urn(DatasetUrn) |
Existing URN | No | -- | Extracts platform, name, and env from an existing URN |
build() |
-- | -- | -- | Constructs the Dataset entity; throws IllegalArgumentException if required fields missing
|
I/O Contract
Input:
- Required:
platform(String) andname(String) - Optional:
env,platformInstance,description,displayName,schemaFields,customProperties
Output: A Dataset entity instance with:
- Auto-generated URN:
urn:li:dataset:(urn:li:dataPlatform:PLATFORM,NAME,ENV) - Cached DatasetProperties aspect (if description, displayName, or customProperties were set)
- Cached SchemaMetadata aspect (if schemaFields were provided)
- Cached DataPlatformInstance aspect (if platformInstance was provided)
Exceptions:
IllegalArgumentException-- ifplatformornameis null
Entity Class Hierarchy
The Dataset class extends Entity and implements multiple trait interfaces:
public class Dataset extends Entity
implements HasTags<Dataset>,
HasGlossaryTerms<Dataset>,
HasOwners<Dataset>,
HasDomains<Dataset>,
HasSubTypes<Dataset>,
HasStructuredProperties<Dataset> {
// ...
}
The entity type is "dataset" and the default aspects fetched from the server are:
OwnershipGlobalTagsGlossaryTermsDomainsStatusInstitutionalMemoryDatasetPropertiesEditableDatasetProperties
Build Process
The build() method (Lines 620-687) follows this sequence:
- Validate that
platformandnameare not null - Create a
DataPlatformUrnfrom the platform string - Construct the dataset key:
(urn:li:dataPlatform:PLATFORM,NAME,ENV) - Parse the full URN:
urn:li:dataset:(urn:li:dataPlatform:PLATFORM,NAME,ENV) - Create a new
Datasetinstance with the generated URN - If description, displayName, or customProperties are set, create and cache a
DatasetPropertiesaspect - If schemaFields are provided, create and cache a
SchemaMetadataaspect - If platformInstance is provided, create and cache a
DataPlatformInstanceaspect
Usage Examples
Minimal Dataset
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_database.my_schema.my_table")
.build();
// URN: urn:li:dataset:(urn:li:dataPlatform:snowflake,my_database.my_schema.my_table,PROD)
Full Dataset
Dataset dataset = Dataset.builder()
.platform("bigquery")
.name("project.dataset.table")
.env("DEV")
.platformInstance("us-east1")
.description("Customer transactions table")
.displayName("Customer Transactions")
.customProperties(Map.of("team", "data-eng", "pii", "true"))
.build();
From Existing URN
DatasetUrn existingUrn = DatasetUrn.createFromString(
"urn:li:dataset:(urn:li:dataPlatform:mysql,db.users,PROD)");
Dataset dataset = Dataset.builder()
.urn(existingUrn)
.description("Users table")
.build();
Related
- Implements: Datahub_project_Datahub_Entity_Construction
- Related Implementation: Datahub_project_Datahub_Entity_Metadata_Mutations
- Related Implementation: Datahub_project_Datahub_EntityClient_Upsert
- Environment: Environment:Datahub_project_Datahub_Java_17_Backend_Environment