Principle:Datahub project Datahub CLI Installation
| Property | Value |
|---|---|
| Page Type | Principle |
| Workflow | Metadata_Ingestion_Pipeline |
| Concept | Installing and configuring the DataHub CLI tool and its connector plugins |
| Repository | https://github.com/datahub-project/datahub |
| Implemented By | Implementation:Datahub_project_Datahub_Pip_Install_Acryl_Datahub |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Description
The CLI Installation principle addresses the fundamental requirement of making the DataHub metadata ingestion framework available as a command-line tool on a user's system. DataHub is distributed as the acryl-datahub Python package, which provides the datahub CLI binary. This CLI serves as the primary interface for running metadata ingestion pipelines, managing recipes, and interacting with a DataHub instance from the terminal.
A defining characteristic of the DataHub CLI is its plugin architecture based on pip extras. Rather than shipping a single monolithic package with every possible data source connector bundled in, the framework uses Python's extras_require mechanism to allow users to install only the connectors they need. Each connector (e.g., MySQL, Snowflake, BigQuery, Kafka) declares its own set of pip dependencies, and users opt in to specific connectors at install time by specifying extras in square brackets.
This design yields several advantages: it keeps the base installation lightweight, avoids dependency conflicts between connectors that may require incompatible library versions, and allows the ecosystem to grow without burdening every user with every dependency.
Usage
The CLI Installation principle applies whenever a user or CI/CD system needs to:
- Set up a fresh environment for running DataHub ingestion pipelines
- Add support for a new data source connector to an existing installation
- Upgrade the DataHub CLI to a newer version while preserving connector selections
- Deploy the ingestion framework in containerized or cloud-based environments where minimal footprint is desirable
The principle also governs how the entry point datahub = datahub.entrypoints:main is registered via setuptools, ensuring that after installation the datahub command is available on the system PATH.
Theoretical Basis
The CLI Installation principle is grounded in the package management for extensible platforms pattern. This pattern is widely used in Python ecosystems where a core framework must support a variable number of optional integrations. The key ideas are:
Separation of core from optional dependencies. The base package (acryl-datahub) includes the ingestion framework, configuration loading, pipeline orchestration, and the REST/Kafka sink infrastructure. Connector-specific dependencies (database drivers, cloud SDKs, API clients) are declared as extras and only installed when explicitly requested.
Plugin registration via entry points. Each connector is registered as a setuptools entry point under the datahub.ingestion.source.plugins group. At runtime, the framework uses the entry point registry to discover available sources, sinks, and transformers. This decouples the framework from any specific connector implementation and allows third-party plugins to be installed and discovered automatically.
Reproducible environments. By specifying extras at install time (e.g., pip install 'acryl-datahub[mysql,snowflake]'), users produce a deterministic set of installed packages. This is critical for CI/CD pipelines and production deployments where reproducibility is a first-class concern.
Dependency isolation. Because connectors declare their own dependency ranges, the extras mechanism provides a natural boundary for managing version conflicts. A user who only needs MySQL does not inherit Snowflake's dependency tree, reducing the surface area for version resolution failures.