Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Datahub project Datahub Dataset Builder

From Leeroopedia


Field Value
Implementation Name Dataset Builder
Type API Doc
Status Active
Last Updated 2026-02-10
Repository Datahub_project_Datahub
Source File metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/entity/Dataset.java (Lines 490-688)

Overview

The Dataset.Builder class provides a fluent API for constructing Dataset entity instances. It enforces required fields (platform, name), generates DataHub URNs automatically, and caches builder-provided properties as aspects for emission during upsert.

Import Statement

import datahub.client.v2.entity.Dataset;

Source Reference

File: metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/entity/Dataset.java (Lines 490-688)

public static class Builder {
    private String platform;
    private String name;
    private String env = "PROD";
    private String platformInstance;
    private String description;
    private String displayName;
    private List<SchemaField> schemaFields;
    private Map<String, String> customProperties;

    @Nonnull
    public Builder platform(@Nonnull String platform) { ... }

    @Nonnull
    public Builder name(@Nonnull String name) { ... }

    @Nonnull
    public Builder env(@Nonnull String env) { ... }

    @Nonnull
    public Builder platformInstance(@Nullable String platformInstance) { ... }

    @Nonnull
    public Builder description(@Nullable String description) { ... }

    @Nonnull
    public Builder displayName(@Nullable String displayName) { ... }

    @Nonnull
    public Builder schemaFields(@Nullable List<SchemaField> fields) { ... }

    @Nonnull
    public Builder customProperties(@Nullable Map<String, String> properties) { ... }

    @Nonnull
    public Builder urn(@Nonnull DatasetUrn urn) { ... }

    @Nonnull
    public Dataset build() { ... }
}

Builder Methods

Method Parameter Required Default Description
platform(String) Platform name Yes -- Data platform identifier (e.g., "snowflake", "bigquery", "kafka")
name(String) Dataset name Yes -- Fully qualified dataset name (e.g., "my_database.my_schema.my_table")
env(String) Environment No "PROD" Environment identifier (e.g., "PROD", "DEV", "QA")
platformInstance(String) Instance name No null Platform instance for multi-instance platforms
description(String) Description No null Dataset description text
displayName(String) Display name No null Human-readable display name
schemaFields(List) Schema fields No null List of SchemaField objects defining the dataset schema
customProperties(Map) Properties map No null Key-value pairs of custom metadata properties
urn(DatasetUrn) Existing URN No -- Extracts platform, name, and env from an existing URN
build() -- -- -- Constructs the Dataset entity; throws IllegalArgumentException if required fields missing

I/O Contract

Input:

  • Required: platform (String) and name (String)
  • Optional: env, platformInstance, description, displayName, schemaFields, customProperties

Output: A Dataset entity instance with:

  • Auto-generated URN: urn:li:dataset:(urn:li:dataPlatform:PLATFORM,NAME,ENV)
  • Cached DatasetProperties aspect (if description, displayName, or customProperties were set)
  • Cached SchemaMetadata aspect (if schemaFields were provided)
  • Cached DataPlatformInstance aspect (if platformInstance was provided)

Exceptions:

  • IllegalArgumentException -- if platform or name is null

Entity Class Hierarchy

The Dataset class extends Entity and implements multiple trait interfaces:

public class Dataset extends Entity
    implements HasTags<Dataset>,
        HasGlossaryTerms<Dataset>,
        HasOwners<Dataset>,
        HasDomains<Dataset>,
        HasSubTypes<Dataset>,
        HasStructuredProperties<Dataset> {
    // ...
}

The entity type is "dataset" and the default aspects fetched from the server are:

  • Ownership
  • GlobalTags
  • GlossaryTerms
  • Domains
  • Status
  • InstitutionalMemory
  • DatasetProperties
  • EditableDatasetProperties

Build Process

The build() method (Lines 620-687) follows this sequence:

  1. Validate that platform and name are not null
  2. Create a DataPlatformUrn from the platform string
  3. Construct the dataset key: (urn:li:dataPlatform:PLATFORM,NAME,ENV)
  4. Parse the full URN: urn:li:dataset:(urn:li:dataPlatform:PLATFORM,NAME,ENV)
  5. Create a new Dataset instance with the generated URN
  6. If description, displayName, or customProperties are set, create and cache a DatasetProperties aspect
  7. If schemaFields are provided, create and cache a SchemaMetadata aspect
  8. If platformInstance is provided, create and cache a DataPlatformInstance aspect

Usage Examples

Minimal Dataset

Dataset dataset = Dataset.builder()
    .platform("snowflake")
    .name("my_database.my_schema.my_table")
    .build();
// URN: urn:li:dataset:(urn:li:dataPlatform:snowflake,my_database.my_schema.my_table,PROD)

Full Dataset

Dataset dataset = Dataset.builder()
    .platform("bigquery")
    .name("project.dataset.table")
    .env("DEV")
    .platformInstance("us-east1")
    .description("Customer transactions table")
    .displayName("Customer Transactions")
    .customProperties(Map.of("team", "data-eng", "pii", "true"))
    .build();

From Existing URN

DatasetUrn existingUrn = DatasetUrn.createFromString(
    "urn:li:dataset:(urn:li:dataPlatform:mysql,db.users,PROD)");
Dataset dataset = Dataset.builder()
    .urn(existingUrn)
    .description("Users table")
    .build();

Related

Knowledge Sources

Domains

Data_Integration, Metadata_Management, Java_SDK

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment