Principle:Datahub project Datahub Actions Deployment

Metadata

Field	Value
Principle ID	P-DHACT-005
Title	Actions Deployment
Category	Event-Driven Automation
Status	Active
Last Updated	2026-02-10
Repository	Datahub_project_Datahub
Knowledge Sources	GitHub - datahub-project/datahub, DataHub Documentation
Domains	Event_Processing, Automation, Metadata_Management

Overview

The process of deploying and running event-driven action pipelines as long-running consumers of DataHub metadata events. Actions deployment launches one or more pipeline configurations as daemon threads consuming from Kafka, with lifecycle management, retry logic, and offset tracking.

Description

Actions deployment involves launching the datahub-actions CLI with one or more YAML configuration files. The deployment process follows these stages:

Pipeline Lifecycle

Configuration Loading: Each YAML config file is loaded with environment variable expansion via load_config_file(). Disabled pipelines (enabled: false) are skipped. Invalid configs are logged and skipped if multiple configs are provided, or cause an error if only one config is specified.
Pipeline Creation: Pipeline.create(config_dict) validates the configuration, instantiates the event source (Kafka consumer), creates the filter and transform chain, and creates the action plugin.
Thread Management: Each pipeline is started in its own daemon thread by the PipelineManager. The manager maintains a registry of PipelineSpec objects (name, pipeline, thread) for lifecycle management.
Event Loop: Each pipeline runs a blocking event loop that consumes events from Kafka, applies transforms, invokes the action, and acknowledges processed events back to Kafka (offset commit).
Shutdown: On SIGINT (Ctrl-C), the signal handler calls PipelineManager.stop_all(), which stops each pipeline (closing sources and actions) and joins each thread.

Kafka Topic Consumption

The default Kafka event source subscribes to three topics:

MetadataChangeLog_Versioned_v1: Versioned metadata change log events
MetadataChangeLog_Timeseries_v1: Timeseries metadata change log events
PlatformEvent_v1: Platform-level events (including EntityChangeEvent)

Execution Guarantees

At-least-once delivery: Events are committed to Kafka after processing. If the action fails, the event may be redelivered on restart.
Configurable retries: The retry_count option controls how many times a single event is retried before being sent to the dead letter queue (failed events log file).
Failure modes: THROW stops the pipeline on unrecoverable failure. CONTINUE logs the failure and moves to the next event.
Failed events logging: Failed events are always written to a log file (default: /tmp/logs/datahub/actions/<pipeline_name>/failed_events.log), regardless of failure mode.

Usage

Use this principle when deploying metadata automation in production or development environments. Common deployment patterns include:

Single pipeline: datahub-actions -c pipeline.yml for focused automation
Multiple pipelines: datahub-actions -c notify.yml -c propagate.yml for running several automations in one process
Container deployment: Run datahub-actions as a long-lived container process alongside the DataHub stack
Monitoring: Enable Prometheus metrics with --enable-monitoring for production observability

Theoretical Basis

Consumer group pattern: Each pipeline uses its name as a Kafka consumer group ID. This has two important consequences:

Independent consumption: Multiple differently-named pipelines independently consume all events from the same Kafka topics. Each pipeline gets its own view of the event stream.
Shared consumption: Multiple instances of the same-named pipeline share consumption (partitioned). This enables horizontal scaling of a single automation across multiple processes.

Thread-per-pipeline model: The PipelineManager runs each pipeline in its own thread, isolating pipeline failures. A failing pipeline does not affect other running pipelines. The main thread sleeps in an infinite loop, serving only as a signal handler anchor for graceful shutdown.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment