Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datahub project Datahub DataFlow Entity

From Leeroopedia
Revision as of 14:42, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Datahub_project_Datahub_DataFlow_Entity.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Java_SDK, Metadata_Management
Last Updated 2026-02-10 00:00 GMT

Overview

DataFlow is a Java SDK V2 entity class representing a DataHub DataFlow entity, a data processing pipeline or workflow such as an Airflow DAG, Spark job, or dbt project.

Description

The DataFlow class extends Entity and implements mixin interfaces (HasTags, HasGlossaryTerms, HasOwners, HasDomains, HasSubTypes, HasStructuredProperties) for standard metadata operations.

Key characteristics:

  • Entity type: "dataFlow"
  • URN format: urn:li:dataFlow:(orchestrator,flowId,cluster) via DataFlowUrn
  • Accumulated patch builder: Uses a shared DataFlowInfoPatchBuilder instance (retrieved/created via getDataFlowInfoPatchBuilder()) to accumulate multiple patch operations (description, display name, custom properties, timestamps) into a single MCP for efficient emission.
  • Mixed mutation strategies: Some methods use the accumulated patch builder (e.g., setDescription, setDisplayName, addCustomProperty, setCreated, setLastModified), while others directly mutate and cache the DataFlowInfo aspect (e.g., setExternalUrl, setProject).
  • Builder: Requires orchestrator, flowId, and cluster. Optionally accepts displayName, description, and customProperties to pre-populate the DataFlowInfo aspect.

Default aspects fetched: Ownership, GlobalTags, GlossaryTerms, Domains, Status, InstitutionalMemory, DataFlowInfo, EditableDataFlowProperties.

Usage

Use the DataFlow entity to represent data pipelines and workflows in DataHub. A DataFlow is the parent container for DataJob entities (individual tasks within the pipeline). Construct via its Builder and upsert through EntityClient.upsert(dataflow).

Code Reference

Source Location

Signature

public class DataFlow extends Entity
    implements HasTags<DataFlow>, HasGlossaryTerms<DataFlow>, HasOwners<DataFlow>,
               HasDomains<DataFlow>, HasSubTypes<DataFlow>, HasStructuredProperties<DataFlow> {

    // Factory
    public static Builder builder();

    // Identity
    public String getEntityType();           // returns "dataFlow"
    public DataFlowUrn getDataFlowUrn();
    public DataFlow mutable();

    // Metadata operations
    public DataFlow setDescription(String description);
    public String getDescription();
    public DataFlow setDisplayName(String displayName);
    public String getDisplayName();
    public DataFlow setExternalUrl(String externalUrl);
    public String getExternalUrl();
    public DataFlow setProject(String project);
    public String getProject();
    public DataFlow setCreated(long createdTimeMs);
    public Long getCreated();
    public DataFlow setLastModified(long lastModifiedTimeMs);
    public Long getLastModified();

    // Custom properties
    public DataFlow addCustomProperty(String key, String value);
    public DataFlow removeCustomProperty(String key);
    public DataFlow setCustomProperties(Map<String, String> properties);
}

Import

import datahub.client.v2.entity.DataFlow;

I/O Contract

Builder Inputs

Parameter Type Required Description
orchestrator String Yes Orchestrator name (e.g., "airflow", "dagster", "prefect")
flowId String Yes Flow/DAG identifier
cluster String Yes Cluster name (e.g., "prod", "dev")
displayName String No Human-readable display name
description String No Pipeline description
customProperties Map<String, String> No Custom key-value properties

Outputs

Method Return Type Description
build() DataFlow New DataFlow entity with DataFlowUrn
getDataFlowUrn() DataFlowUrn The dataflow URN containing orchestrator, flowId, cluster

Usage Examples

// Create a dataflow
DataFlow pipeline = DataFlow.builder()
    .orchestrator("airflow")
    .flowId("customer_etl_dag")
    .cluster("prod")
    .displayName("Customer ETL Pipeline")
    .description("Daily ETL for customer data")
    .build();

// Add metadata via fluent API
pipeline.addTag("critical");
pipeline.addOwner("urn:li:corpuser:data_eng", OwnershipType.TECHNICAL_OWNER);
pipeline.setDomain("urn:li:domain:DataEngineering");
pipeline.addCustomProperty("schedule", "0 6 * * *");
pipeline.setCreated(System.currentTimeMillis());

// Upsert to DataHub
client.entities().upsert(pipeline);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment