Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datahub project Datahub DataJob Entity

From Leeroopedia


Knowledge Sources
Domains Java_SDK, Metadata_Management
Last Updated 2026-02-10 00:00 GMT

Overview

DataJob is a Java SDK V2 entity class representing a DataHub DataJob entity, a unit of work in a data processing pipeline such as an Airflow task, dbt model, or Spark job, with comprehensive lineage support including fine-grained column-level lineage.

Description

The DataJob class extends Entity and implements mixin interfaces (HasTags, HasGlossaryTerms, HasOwners, HasDomains, HasSubTypes, HasStructuredProperties). At 1214 lines, it is the most feature-rich entity class in the SDK V2, providing extensive lineage operations.

Key characteristics:

  • Entity type: "dataJob"
  • URN format: urn:li:dataJob:(urn:li:dataFlow:(orchestrator,flowId,cluster),jobId) via DataJobUrn
  • Lineage operations:
    • Input datasets: addInputDataset(), removeInputDataset(), setInputDatasets(), getInputDatasets() - track datasets this job reads from
    • Output datasets: addOutputDataset(), removeOutputDataset(), setOutputDatasets(), getOutputDatasets() - track datasets this job writes to
    • Input data jobs: addInputDataJob(), removeInputDataJob(), setInputDataJobs(), getInputDataJobs() - track upstream job dependencies
    • Input/output fields: addInputField(), removeInputField(), addOutputField(), removeOutputField() - track schema field (column) level I/O
    • Fine-grained lineage: addFineGrainedLineage(), removeFineGrainedLineage(), getFineGrainedLineages() - track column-to-column transformations with transformation operation type, confidence score, and optional query URN
  • Patch builders: Uses DataJobInfoPatchBuilder for description, name, and custom properties. Uses DataJobInputOutputPatchBuilder for all lineage operations. Each lineage mutation creates a separate patch MCP.
  • Builder: Requires orchestrator, flowId, and jobId. Cluster defaults to "prod". If creating a DataJobInfo aspect, both name and type are required.

Default aspects fetched: Ownership, GlobalTags, GlossaryTerms, Domains, Status, InstitutionalMemory, DataJobInfo, EditableDataJobProperties.

Usage

Use the DataJob entity to represent individual tasks within a data pipeline and to establish lineage relationships between datasets, jobs, and columns. It is particularly powerful for building fine-grained lineage graphs. Construct via its Builder and upsert through EntityClient.upsert(dataJob).

Code Reference

Source Location

Signature

public class DataJob extends Entity
    implements HasTags<DataJob>, HasGlossaryTerms<DataJob>, HasOwners<DataJob>,
               HasDomains<DataJob>, HasSubTypes<DataJob>, HasStructuredProperties<DataJob> {

    // Factory
    public static Builder builder();

    // Identity
    public String getEntityType();           // returns "dataJob"
    public DataJobUrn getDataJobUrn();
    public DataJob mutable();

    // Metadata
    public DataJob setDescription(String description);
    public String getDescription();
    public DataJob setName(String name);
    public String getName();

    // Input datasets
    public DataJob addInputDataset(String datasetUrn);
    public DataJob addInputDataset(DatasetUrn datasetUrn);
    public DataJob setInputDatasets(List<String> datasetUrns);
    public DataJob removeInputDataset(String datasetUrn);
    public DataJob removeInputDataset(DatasetUrn datasetUrn);
    public List<DatasetUrn> getInputDatasets();

    // Output datasets
    public DataJob addOutputDataset(String datasetUrn);
    public DataJob addOutputDataset(DatasetUrn datasetUrn);
    public DataJob setOutputDatasets(List<String> datasetUrns);
    public DataJob removeOutputDataset(String datasetUrn);
    public DataJob removeOutputDataset(DatasetUrn datasetUrn);
    public List<DatasetUrn> getOutputDatasets();

    // Input data job dependencies
    public DataJob addInputDataJob(String dataJobUrn);
    public DataJob addInputDataJob(DataJobUrn dataJobUrn);
    public DataJob setInputDataJobs(List<String> dataJobUrns);
    public DataJob removeInputDataJob(String dataJobUrn);
    public DataJob removeInputDataJob(DataJobUrn dataJobUrn);
    public List<DataJobUrn> getInputDataJobs();

    // Field-level I/O
    public DataJob addInputField(String fieldUrn);
    public DataJob addInputField(Urn fieldUrn);
    public DataJob removeInputField(String fieldUrn);
    public DataJob removeInputField(Urn fieldUrn);
    public List<Urn> getInputFields();
    public DataJob addOutputField(String fieldUrn);
    public DataJob addOutputField(Urn fieldUrn);
    public DataJob removeOutputField(String fieldUrn);
    public DataJob removeOutputField(Urn fieldUrn);
    public List<Urn> getOutputFields();

    // Fine-grained (column-level) lineage
    public DataJob addFineGrainedLineage(String upstream, String downstream, String operation, Float confidence);
    public DataJob addFineGrainedLineage(Urn upstream, Urn downstream, String operation, Float confidence, Urn queryUrn);
    public DataJob addFineGrainedLineage(String upstream, String downstream, String operation);
    public DataJob removeFineGrainedLineage(String upstream, String downstream, String operation);
    public DataJob removeFineGrainedLineage(Urn upstream, Urn downstream, String operation, Urn queryUrn);
    public List<FineGrainedLineage> getFineGrainedLineages();

    // Custom properties
    public DataJob addCustomProperty(String key, String value);
    public DataJob removeCustomProperty(String key);
    public DataJob setCustomProperties(Map<String, String> properties);
}

Import

import datahub.client.v2.entity.DataJob;

I/O Contract

Builder Inputs

Parameter Type Required Description
orchestrator String Yes Orchestrator name (e.g., "airflow", "dagster")
flowId String Yes Parent flow/DAG ID
jobId String Yes Task/job identifier
cluster String No Cluster name, defaults to "prod"
name String Conditional Display name (required if creating DataJobInfo)
type String Conditional Job type e.g., "BATCH", "STREAMING" (required if creating DataJobInfo)
description String No Job description
customProperties Map<String, String> No Custom key-value properties

Fine-Grained Lineage Parameters

Parameter Type Description
upstreamField String or Urn Schema field URN: urn:li:schemaField:(DATASET_URN,COLUMN_NAME)
downstreamField String or Urn Target schema field URN
transformationOperation String Operation type: "TRANSFORM", "IDENTITY", "AGGREGATION"
confidenceScore Float Confidence between 0.0 and 1.0 (defaults to 1.0)
queryUrn Urn Optional query URN this lineage was derived from

Usage Examples

// Create a data job
DataJob task = DataJob.builder()
    .orchestrator("airflow")
    .flowId("customer_etl")
    .jobId("transform_customers")
    .cluster("prod")
    .name("Transform Customers")
    .type("BATCH")
    .description("Transforms raw customer data")
    .build();

// Define lineage
task.addInputDataset("urn:li:dataset:(urn:li:dataPlatform:mysql,raw.customers,PROD)");
task.addOutputDataset("urn:li:dataset:(urn:li:dataPlatform:snowflake,analytics.customers,PROD)");
task.addInputDataJob("urn:li:dataJob:(urn:li:dataFlow:(airflow,customer_etl,prod),extract_task)");

// Fine-grained column-level lineage
task.addFineGrainedLineage(
    "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:mysql,raw.customers,PROD),full_name)",
    "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:snowflake,analytics.customers,PROD),name)",
    "TRANSFORM"
);

// Add metadata
task.addTag("critical");
task.addOwner("urn:li:corpuser:data_team", OwnershipType.TECHNICAL_OWNER);

// Upsert to DataHub
client.entities().upsert(task);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment