Principle:Datahub project Datahub CLI Installation

Property	Value
Page Type	Principle
Workflow	Metadata_Ingestion_Pipeline
Concept	Installing and configuring the DataHub CLI tool and its connector plugins
Repository	https://github.com/datahub-project/datahub
Implemented By	Implementation:Datahub_project_Datahub_Pip_Install_Acryl_Datahub
Last Updated	2026-02-09 17:00 GMT

Overview

Description

The CLI Installation principle addresses the fundamental requirement of making the DataHub metadata ingestion framework available as a command-line tool on a user's system. DataHub is distributed as the acryl-datahub Python package, which provides the datahub CLI binary. This CLI serves as the primary interface for running metadata ingestion pipelines, managing recipes, and interacting with a DataHub instance from the terminal.

A defining characteristic of the DataHub CLI is its plugin architecture based on pip extras. Rather than shipping a single monolithic package with every possible data source connector bundled in, the framework uses Python's extras_require mechanism to allow users to install only the connectors they need. Each connector (e.g., MySQL, Snowflake, BigQuery, Kafka) declares its own set of pip dependencies, and users opt in to specific connectors at install time by specifying extras in square brackets.

This design yields several advantages: it keeps the base installation lightweight, avoids dependency conflicts between connectors that may require incompatible library versions, and allows the ecosystem to grow without burdening every user with every dependency.

Usage

The CLI Installation principle applies whenever a user or CI/CD system needs to:

Set up a fresh environment for running DataHub ingestion pipelines
Add support for a new data source connector to an existing installation
Upgrade the DataHub CLI to a newer version while preserving connector selections
Deploy the ingestion framework in containerized or cloud-based environments where minimal footprint is desirable

The principle also governs how the entry point datahub = datahub.entrypoints:main is registered via setuptools, ensuring that after installation the datahub command is available on the system PATH.

Theoretical Basis

The CLI Installation principle is grounded in the package management for extensible platforms pattern. This pattern is widely used in Python ecosystems where a core framework must support a variable number of optional integrations. The key ideas are:

Separation of core from optional dependencies. The base package (acryl-datahub) includes the ingestion framework, configuration loading, pipeline orchestration, and the REST/Kafka sink infrastructure. Connector-specific dependencies (database drivers, cloud SDKs, API clients) are declared as extras and only installed when explicitly requested.

Plugin registration via entry points. Each connector is registered as a setuptools entry point under the datahub.ingestion.source.plugins group. At runtime, the framework uses the entry point registry to discover available sources, sinks, and transformers. This decouples the framework from any specific connector implementation and allows third-party plugins to be installed and discovered automatically.

Reproducible environments. By specifying extras at install time (e.g., pip install 'acryl-datahub[mysql,snowflake]'), users produce a deterministic set of installed packages. This is critical for CI/CD pipelines and production deployments where reproducibility is a first-class concern.

Dependency isolation. Because connectors declare their own dependency ranges, the extras mechanism provides a natural boundary for managing version conflicts. A user who only needs MySQL does not inherit Snowflake's dependency tree, reducing the surface area for version resolution failures.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment