Implementation:Datahub project Datahub DataFlow Entity
| Knowledge Sources | |
|---|---|
| Domains | Java_SDK, Metadata_Management |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
DataFlow is a Java SDK V2 entity class representing a DataHub DataFlow entity, a data processing pipeline or workflow such as an Airflow DAG, Spark job, or dbt project.
Description
The DataFlow class extends Entity and implements mixin interfaces (HasTags, HasGlossaryTerms, HasOwners, HasDomains, HasSubTypes, HasStructuredProperties) for standard metadata operations.
Key characteristics:
- Entity type:
"dataFlow" - URN format:
urn:li:dataFlow:(orchestrator,flowId,cluster)viaDataFlowUrn - Accumulated patch builder: Uses a shared
DataFlowInfoPatchBuilderinstance (retrieved/created viagetDataFlowInfoPatchBuilder()) to accumulate multiple patch operations (description, display name, custom properties, timestamps) into a single MCP for efficient emission. - Mixed mutation strategies: Some methods use the accumulated patch builder (e.g.,
setDescription,setDisplayName,addCustomProperty,setCreated,setLastModified), while others directly mutate and cache theDataFlowInfoaspect (e.g.,setExternalUrl,setProject). - Builder: Requires
orchestrator,flowId, andcluster. Optionally acceptsdisplayName,description, andcustomPropertiesto pre-populate theDataFlowInfoaspect.
Default aspects fetched: Ownership, GlobalTags, GlossaryTerms, Domains, Status, InstitutionalMemory, DataFlowInfo, EditableDataFlowProperties.
Usage
Use the DataFlow entity to represent data pipelines and workflows in DataHub. A DataFlow is the parent container for DataJob entities (individual tasks within the pipeline). Construct via its Builder and upsert through EntityClient.upsert(dataflow).
Code Reference
Source Location
- Repository: Datahub_project_Datahub
- File: metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/entity/DataFlow.java
Signature
public class DataFlow extends Entity
implements HasTags<DataFlow>, HasGlossaryTerms<DataFlow>, HasOwners<DataFlow>,
HasDomains<DataFlow>, HasSubTypes<DataFlow>, HasStructuredProperties<DataFlow> {
// Factory
public static Builder builder();
// Identity
public String getEntityType(); // returns "dataFlow"
public DataFlowUrn getDataFlowUrn();
public DataFlow mutable();
// Metadata operations
public DataFlow setDescription(String description);
public String getDescription();
public DataFlow setDisplayName(String displayName);
public String getDisplayName();
public DataFlow setExternalUrl(String externalUrl);
public String getExternalUrl();
public DataFlow setProject(String project);
public String getProject();
public DataFlow setCreated(long createdTimeMs);
public Long getCreated();
public DataFlow setLastModified(long lastModifiedTimeMs);
public Long getLastModified();
// Custom properties
public DataFlow addCustomProperty(String key, String value);
public DataFlow removeCustomProperty(String key);
public DataFlow setCustomProperties(Map<String, String> properties);
}
Import
import datahub.client.v2.entity.DataFlow;
I/O Contract
Builder Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
orchestrator |
String |
Yes | Orchestrator name (e.g., "airflow", "dagster", "prefect") |
flowId |
String |
Yes | Flow/DAG identifier |
cluster |
String |
Yes | Cluster name (e.g., "prod", "dev") |
displayName |
String |
No | Human-readable display name |
description |
String |
No | Pipeline description |
customProperties |
Map<String, String> |
No | Custom key-value properties |
Outputs
| Method | Return Type | Description |
|---|---|---|
build() |
DataFlow |
New DataFlow entity with DataFlowUrn
|
getDataFlowUrn() |
DataFlowUrn |
The dataflow URN containing orchestrator, flowId, cluster |
Usage Examples
// Create a dataflow
DataFlow pipeline = DataFlow.builder()
.orchestrator("airflow")
.flowId("customer_etl_dag")
.cluster("prod")
.displayName("Customer ETL Pipeline")
.description("Daily ETL for customer data")
.build();
// Add metadata via fluent API
pipeline.addTag("critical");
pipeline.addOwner("urn:li:corpuser:data_eng", OwnershipType.TECHNICAL_OWNER);
pipeline.setDomain("urn:li:domain:DataEngineering");
pipeline.addCustomProperty("schedule", "0 6 * * *");
pipeline.setCreated(System.currentTimeMillis());
// Upsert to DataHub
client.entities().upsert(pipeline);