Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Datahub project Datahub Agent Deployment

From Leeroopedia


Attribute Value
Page Type Principle
Workflow Spark_Lineage_Capture
Pair 1 of 6
Implementation Implementation:Datahub_project_Datahub_Spark_Submit_Agent_JAR
Repository https://github.com/datahub-project/datahub
Last Updated 2026-02-09 17:00 GMT

Overview

Description

Agent Deployment is the principle of deploying instrumentation agents alongside distributed processing engines to capture runtime metadata without modifying application code. In the context of Apache Spark and DataHub, this principle manifests as a lightweight Java agent JAR that is loaded into the Spark driver classpath at launch time. The agent attaches itself to the Spark runtime through the spark.extraListeners mechanism, enabling transparent interception of job lifecycle events and the extraction of lineage metadata.

This approach decouples metadata collection from application logic: the Spark job author does not need to write any instrumentation code. Instead, the agent is deployed as a sidecar artifact that participates in the Spark execution environment through well-defined extension points.

Usage

The Agent Deployment principle applies whenever an organization needs to capture lineage metadata from Spark jobs running in any environment -- local development, on-premise clusters, Amazon EMR, Databricks, or cloud-managed Spark services. By deploying the agent JAR via spark-submit --packages or spark.jars.packages configuration, teams can instrument entire fleets of Spark applications with zero code changes.

This principle is essential when:

  • Multiple teams author Spark jobs independently and a centralized lineage solution is required.
  • The organization cannot modify existing Spark application source code.
  • Metadata collection must be consistent across heterogeneous Spark deployment environments (EMR, Databricks, standalone clusters).

Theoretical Basis

The Agent Deployment principle draws from the agent-based instrumentation pattern widely used in application performance monitoring (APM) and observability systems. In this pattern, a software agent is injected into the runtime environment of a target application to observe its behavior without altering its source code. The key characteristics are:

Non-invasive attachment: The agent leverages the host runtime's extension mechanisms rather than requiring bytecode modification or source-level changes. In Spark's case, the SparkListener API provides a first-class hook for external observers.

Sidecar deployment model: The agent is packaged as a standalone artifact (a shadow JAR in this case) that is deployed alongside the application. This mirrors the sidecar pattern from microservices architecture, where auxiliary functionality is co-deployed with the primary service without tight coupling.

Plugin lifecycle management: The agent participates in the host application's lifecycle through defined callbacks. The Spark SparkListener interface provides hooks for application start/end and job start/end events, giving the agent natural integration points for metadata capture.

Shadow JAR packaging: To avoid classpath conflicts with the host Spark application, the agent uses a shadow (uber) JAR that relocates its dependencies into unique package namespaces. This ensures that the agent's transitive dependencies do not interfere with the application's own library versions.

The theoretical advantage of agent-based instrumentation over alternatives (such as log parsing, query plan analysis after the fact, or manual lineage annotation) is that it captures metadata at runtime with full fidelity to the actual execution, including dynamic query plans and runtime-resolved dataset paths.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment