Principle:Datahub project Datahub CLI Package Installation

Field	Value
Principle Name	CLI Package Installation
Overview	The practice of installing DataHub's command-line interface tooling to gain access to metadata ingestion, Docker management, and administrative commands.
Status	Active
Domains	Data_Integration, Metadata_Management
Related Implementations	Datahub_project_Datahub_Pip_Install_Acryl_Datahub
Last Updated	2026-02-10
Knowledge Sources	DataHub Repository

Description

CLI package installation provisions the datahub console script and all associated plugin entry points (sources, sinks, transformers, reporters). The modular extras system allows selecting only the connectors needed, minimizing dependency footprint. The package is published as acryl-datahub on PyPI and exposes a single top-level console script entry point: datahub = datahub.entrypoints:main.

The package uses Python's setuptools with a plugin registry architecture. Entry points are declared for several plugin categories:

datahub.ingestion.source.plugins -- Source connectors for extracting metadata (over 60 connectors including Snowflake, BigQuery, MySQL, Kafka, and more)
datahub.ingestion.transformer.plugins -- Transformers for modifying metadata in-flight (ownership, tags, domains, terms, etc.)
datahub.ingestion.sink.plugins -- Sinks for writing metadata (datahub-rest, datahub-kafka, file, console)
datahub.ingestion.reporting_provider.plugins -- Reporting providers for ingestion run summaries
datahub.ingestion.checkpointing_provider.plugins -- State checkpointing for stateful ingestion

Usage

Install the DataHub CLI when:

Setting up a new environment for metadata ingestion
Deploying DataHub locally for development or testing
Integrating DataHub CLI into CI/CD pipelines for automated metadata synchronization
Building custom ingestion scripts that leverage the DataHub Python SDK

Theoretical Basis

Package management with extras/optional dependencies follows the Python packaging convention of declaring optional feature groups (PEP 508). This allows a single package to serve multiple use cases without forcing all dependencies on every user. Each connector extra (e.g., snowflake, bigquery, mysql) declares its own dependency set, so users install only what they need. This avoids dependency conflicts and keeps the base installation lightweight.

The plugin registry pattern uses Python's entry_points mechanism, which allows third-party packages to register additional sources, sinks, and transformers without modifying the core package.

Constraints

Requires Python >= 3.10
Some connector extras have native library dependencies (e.g., confluent-kafka requires librdkafka)
The base installation includes framework dependencies (click, PyYAML, pydantic, avro, etc.) but no connector-specific libraries

Related Pages

Implemented by: Datahub_project_Datahub_Pip_Install_Acryl_Datahub

Implementation:Datahub_project_Datahub_Pip_Install_Acryl_Datahub

Related: Datahub_project_Datahub_Recipe_Configuration
Related: Datahub_project_Datahub_Batch_Ingestion_Execution
Heuristic: Heuristic:Datahub_project_Datahub_Gradle_Formatting_Over_Direct_Tools

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment