Principle:Datahub project Datahub Agent Deployment

Attribute	Value
Page Type	Principle
Workflow	Spark_Lineage_Capture
Pair	1 of 6
Implementation	Implementation:Datahub_project_Datahub_Spark_Submit_Agent_JAR
Repository	https://github.com/datahub-project/datahub
Last Updated	2026-02-09 17:00 GMT

Overview

Description

Agent Deployment is the principle of deploying instrumentation agents alongside distributed processing engines to capture runtime metadata without modifying application code. In the context of Apache Spark and DataHub, this principle manifests as a lightweight Java agent JAR that is loaded into the Spark driver classpath at launch time. The agent attaches itself to the Spark runtime through the spark.extraListeners mechanism, enabling transparent interception of job lifecycle events and the extraction of lineage metadata.

This approach decouples metadata collection from application logic: the Spark job author does not need to write any instrumentation code. Instead, the agent is deployed as a sidecar artifact that participates in the Spark execution environment through well-defined extension points.

Usage

The Agent Deployment principle applies whenever an organization needs to capture lineage metadata from Spark jobs running in any environment -- local development, on-premise clusters, Amazon EMR, Databricks, or cloud-managed Spark services. By deploying the agent JAR via spark-submit --packages or spark.jars.packages configuration, teams can instrument entire fleets of Spark applications with zero code changes.

This principle is essential when:

Multiple teams author Spark jobs independently and a centralized lineage solution is required.
The organization cannot modify existing Spark application source code.
Metadata collection must be consistent across heterogeneous Spark deployment environments (EMR, Databricks, standalone clusters).

Theoretical Basis

The Agent Deployment principle draws from the agent-based instrumentation pattern widely used in application performance monitoring (APM) and observability systems. In this pattern, a software agent is injected into the runtime environment of a target application to observe its behavior without altering its source code. The key characteristics are:

Non-invasive attachment: The agent leverages the host runtime's extension mechanisms rather than requiring bytecode modification or source-level changes. In Spark's case, the SparkListener API provides a first-class hook for external observers.

Sidecar deployment model: The agent is packaged as a standalone artifact (a shadow JAR in this case) that is deployed alongside the application. This mirrors the sidecar pattern from microservices architecture, where auxiliary functionality is co-deployed with the primary service without tight coupling.

Plugin lifecycle management: The agent participates in the host application's lifecycle through defined callbacks. The Spark SparkListener interface provides hooks for application start/end and job start/end events, giving the agent natural integration points for metadata capture.

Shadow JAR packaging: To avoid classpath conflicts with the host Spark application, the agent uses a shadow (uber) JAR that relocates its dependencies into unique package namespaces. This ensures that the agent's transitive dependencies do not interfere with the application's own library versions.

The theoretical advantage of agent-based instrumentation over alternatives (such as log parsing, query plan analysis after the fact, or manual lineage annotation) is that it captures metadata at runtime with full fidelity to the actual execution, including dynamic query plans and runtime-resolved dataset paths.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment