Implementation:Datahub project Datahub DataJob Entity
| Knowledge Sources | |
|---|---|
| Domains | Java_SDK, Metadata_Management |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
DataJob is a Java SDK V2 entity class representing a DataHub DataJob entity, a unit of work in a data processing pipeline such as an Airflow task, dbt model, or Spark job, with comprehensive lineage support including fine-grained column-level lineage.
Description
The DataJob class extends Entity and implements mixin interfaces (HasTags, HasGlossaryTerms, HasOwners, HasDomains, HasSubTypes, HasStructuredProperties). At 1214 lines, it is the most feature-rich entity class in the SDK V2, providing extensive lineage operations.
Key characteristics:
- Entity type:
"dataJob" - URN format:
urn:li:dataJob:(urn:li:dataFlow:(orchestrator,flowId,cluster),jobId)viaDataJobUrn - Lineage operations:
- Input datasets:
addInputDataset(),removeInputDataset(),setInputDatasets(),getInputDatasets()- track datasets this job reads from - Output datasets:
addOutputDataset(),removeOutputDataset(),setOutputDatasets(),getOutputDatasets()- track datasets this job writes to - Input data jobs:
addInputDataJob(),removeInputDataJob(),setInputDataJobs(),getInputDataJobs()- track upstream job dependencies - Input/output fields:
addInputField(),removeInputField(),addOutputField(),removeOutputField()- track schema field (column) level I/O - Fine-grained lineage:
addFineGrainedLineage(),removeFineGrainedLineage(),getFineGrainedLineages()- track column-to-column transformations with transformation operation type, confidence score, and optional query URN
- Input datasets:
- Patch builders: Uses
DataJobInfoPatchBuilderfor description, name, and custom properties. UsesDataJobInputOutputPatchBuilderfor all lineage operations. Each lineage mutation creates a separate patch MCP. - Builder: Requires
orchestrator,flowId, andjobId. Cluster defaults to"prod". If creating aDataJobInfoaspect, bothnameandtypeare required.
Default aspects fetched: Ownership, GlobalTags, GlossaryTerms, Domains, Status, InstitutionalMemory, DataJobInfo, EditableDataJobProperties.
Usage
Use the DataJob entity to represent individual tasks within a data pipeline and to establish lineage relationships between datasets, jobs, and columns. It is particularly powerful for building fine-grained lineage graphs. Construct via its Builder and upsert through EntityClient.upsert(dataJob).
Code Reference
Source Location
- Repository: Datahub_project_Datahub
- File: metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/entity/DataJob.java
Signature
public class DataJob extends Entity
implements HasTags<DataJob>, HasGlossaryTerms<DataJob>, HasOwners<DataJob>,
HasDomains<DataJob>, HasSubTypes<DataJob>, HasStructuredProperties<DataJob> {
// Factory
public static Builder builder();
// Identity
public String getEntityType(); // returns "dataJob"
public DataJobUrn getDataJobUrn();
public DataJob mutable();
// Metadata
public DataJob setDescription(String description);
public String getDescription();
public DataJob setName(String name);
public String getName();
// Input datasets
public DataJob addInputDataset(String datasetUrn);
public DataJob addInputDataset(DatasetUrn datasetUrn);
public DataJob setInputDatasets(List<String> datasetUrns);
public DataJob removeInputDataset(String datasetUrn);
public DataJob removeInputDataset(DatasetUrn datasetUrn);
public List<DatasetUrn> getInputDatasets();
// Output datasets
public DataJob addOutputDataset(String datasetUrn);
public DataJob addOutputDataset(DatasetUrn datasetUrn);
public DataJob setOutputDatasets(List<String> datasetUrns);
public DataJob removeOutputDataset(String datasetUrn);
public DataJob removeOutputDataset(DatasetUrn datasetUrn);
public List<DatasetUrn> getOutputDatasets();
// Input data job dependencies
public DataJob addInputDataJob(String dataJobUrn);
public DataJob addInputDataJob(DataJobUrn dataJobUrn);
public DataJob setInputDataJobs(List<String> dataJobUrns);
public DataJob removeInputDataJob(String dataJobUrn);
public DataJob removeInputDataJob(DataJobUrn dataJobUrn);
public List<DataJobUrn> getInputDataJobs();
// Field-level I/O
public DataJob addInputField(String fieldUrn);
public DataJob addInputField(Urn fieldUrn);
public DataJob removeInputField(String fieldUrn);
public DataJob removeInputField(Urn fieldUrn);
public List<Urn> getInputFields();
public DataJob addOutputField(String fieldUrn);
public DataJob addOutputField(Urn fieldUrn);
public DataJob removeOutputField(String fieldUrn);
public DataJob removeOutputField(Urn fieldUrn);
public List<Urn> getOutputFields();
// Fine-grained (column-level) lineage
public DataJob addFineGrainedLineage(String upstream, String downstream, String operation, Float confidence);
public DataJob addFineGrainedLineage(Urn upstream, Urn downstream, String operation, Float confidence, Urn queryUrn);
public DataJob addFineGrainedLineage(String upstream, String downstream, String operation);
public DataJob removeFineGrainedLineage(String upstream, String downstream, String operation);
public DataJob removeFineGrainedLineage(Urn upstream, Urn downstream, String operation, Urn queryUrn);
public List<FineGrainedLineage> getFineGrainedLineages();
// Custom properties
public DataJob addCustomProperty(String key, String value);
public DataJob removeCustomProperty(String key);
public DataJob setCustomProperties(Map<String, String> properties);
}
Import
import datahub.client.v2.entity.DataJob;
I/O Contract
Builder Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
orchestrator |
String |
Yes | Orchestrator name (e.g., "airflow", "dagster") |
flowId |
String |
Yes | Parent flow/DAG ID |
jobId |
String |
Yes | Task/job identifier |
cluster |
String |
No | Cluster name, defaults to "prod" |
name |
String |
Conditional | Display name (required if creating DataJobInfo) |
type |
String |
Conditional | Job type e.g., "BATCH", "STREAMING" (required if creating DataJobInfo) |
description |
String |
No | Job description |
customProperties |
Map<String, String> |
No | Custom key-value properties |
Fine-Grained Lineage Parameters
| Parameter | Type | Description |
|---|---|---|
upstreamField |
String or Urn |
Schema field URN: urn:li:schemaField:(DATASET_URN,COLUMN_NAME)
|
downstreamField |
String or Urn |
Target schema field URN |
transformationOperation |
String |
Operation type: "TRANSFORM", "IDENTITY", "AGGREGATION" |
confidenceScore |
Float |
Confidence between 0.0 and 1.0 (defaults to 1.0) |
queryUrn |
Urn |
Optional query URN this lineage was derived from |
Usage Examples
// Create a data job
DataJob task = DataJob.builder()
.orchestrator("airflow")
.flowId("customer_etl")
.jobId("transform_customers")
.cluster("prod")
.name("Transform Customers")
.type("BATCH")
.description("Transforms raw customer data")
.build();
// Define lineage
task.addInputDataset("urn:li:dataset:(urn:li:dataPlatform:mysql,raw.customers,PROD)");
task.addOutputDataset("urn:li:dataset:(urn:li:dataPlatform:snowflake,analytics.customers,PROD)");
task.addInputDataJob("urn:li:dataJob:(urn:li:dataFlow:(airflow,customer_etl,prod),extract_task)");
// Fine-grained column-level lineage
task.addFineGrainedLineage(
"urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:mysql,raw.customers,PROD),full_name)",
"urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:snowflake,analytics.customers,PROD),name)",
"TRANSFORM"
);
// Add metadata
task.addTag("critical");
task.addOwner("urn:li:corpuser:data_team", OwnershipType.TECHNICAL_OWNER);
// Upsert to DataHub
client.entities().upsert(task);