Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Datahub project Datahub Actions Deployment

From Leeroopedia


Metadata

Field Value
Principle ID P-DHACT-005
Title Actions Deployment
Category Event-Driven Automation
Status Active
Last Updated 2026-02-10
Repository Datahub_project_Datahub
Knowledge Sources GitHub - datahub-project/datahub, DataHub Documentation
Domains Event_Processing, Automation, Metadata_Management

Overview

The process of deploying and running event-driven action pipelines as long-running consumers of DataHub metadata events. Actions deployment launches one or more pipeline configurations as daemon threads consuming from Kafka, with lifecycle management, retry logic, and offset tracking.

Description

Actions deployment involves launching the datahub-actions CLI with one or more YAML configuration files. The deployment process follows these stages:

Pipeline Lifecycle

  1. Configuration Loading: Each YAML config file is loaded with environment variable expansion via load_config_file(). Disabled pipelines (enabled: false) are skipped. Invalid configs are logged and skipped if multiple configs are provided, or cause an error if only one config is specified.
  2. Pipeline Creation: Pipeline.create(config_dict) validates the configuration, instantiates the event source (Kafka consumer), creates the filter and transform chain, and creates the action plugin.
  3. Thread Management: Each pipeline is started in its own daemon thread by the PipelineManager. The manager maintains a registry of PipelineSpec objects (name, pipeline, thread) for lifecycle management.
  4. Event Loop: Each pipeline runs a blocking event loop that consumes events from Kafka, applies transforms, invokes the action, and acknowledges processed events back to Kafka (offset commit).
  5. Shutdown: On SIGINT (Ctrl-C), the signal handler calls PipelineManager.stop_all(), which stops each pipeline (closing sources and actions) and joins each thread.

Kafka Topic Consumption

The default Kafka event source subscribes to three topics:

  • MetadataChangeLog_Versioned_v1: Versioned metadata change log events
  • MetadataChangeLog_Timeseries_v1: Timeseries metadata change log events
  • PlatformEvent_v1: Platform-level events (including EntityChangeEvent)

Execution Guarantees

  • At-least-once delivery: Events are committed to Kafka after processing. If the action fails, the event may be redelivered on restart.
  • Configurable retries: The retry_count option controls how many times a single event is retried before being sent to the dead letter queue (failed events log file).
  • Failure modes: THROW stops the pipeline on unrecoverable failure. CONTINUE logs the failure and moves to the next event.
  • Failed events logging: Failed events are always written to a log file (default: /tmp/logs/datahub/actions/<pipeline_name>/failed_events.log), regardless of failure mode.

Usage

Use this principle when deploying metadata automation in production or development environments. Common deployment patterns include:

  • Single pipeline: datahub-actions -c pipeline.yml for focused automation
  • Multiple pipelines: datahub-actions -c notify.yml -c propagate.yml for running several automations in one process
  • Container deployment: Run datahub-actions as a long-lived container process alongside the DataHub stack
  • Monitoring: Enable Prometheus metrics with --enable-monitoring for production observability

Theoretical Basis

Consumer group pattern: Each pipeline uses its name as a Kafka consumer group ID. This has two important consequences:

  1. Independent consumption: Multiple differently-named pipelines independently consume all events from the same Kafka topics. Each pipeline gets its own view of the event stream.
  2. Shared consumption: Multiple instances of the same-named pipeline share consumption (partitioned). This enables horizontal scaling of a single automation across multiple processes.

Thread-per-pipeline model: The PipelineManager runs each pipeline in its own thread, isolating pipeline failures. A failing pipeline does not affect other running pipelines. The main thread sleeps in an infinite loop, serving only as a signal handler anchor for graceful shutdown.

Related

Implementation:Datahub_project_Datahub_Actions_CLI_Run

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment