Environment:Datahub project Datahub Python Ingestion
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Data_Engineering |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Python 3.10+ runtime environment required for the DataHub CLI, metadata ingestion framework, and all Python-based source connectors.
Description
This environment provides the Python runtime for the acryl-datahub package, which includes the datahub CLI tool and the full metadata ingestion framework. The package uses setuptools with extensive extras_require declarations for 60+ source connectors. Key base dependencies include Pydantic 2.x, Click, PyYAML, and the Avro serialization library. The environment requires Python 3.10 or higher, with Python 3.11 recommended for development.
Usage
Use this environment for metadata ingestion via the CLI or Python library, developing new source connectors, and running smoke tests. It is the mandatory prerequisite for the Pip_Install_Acryl_Datahub, Pipeline_Create_And_Run, and PipelineConfig_From_Recipe implementations.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu 24.04 preferred), macOS, Windows (limited) | Docker images use Ubuntu 24.04 base |
| Python | >= 3.10 (3.11 recommended) | Smoke tests require exactly Python 3.11.x |
| pip | Latest recommended | Needed for wheel building of native extensions |
| RAM | 2+ GB for basic ingestion | Some connectors (BigQuery, Snowflake) may require more |
Dependencies
System Packages
python3.10+(orpython3.11for development)pip(latest)setuptools< 82.0.0 (82.0.0 deprecatedpkg_resources)wheelgcc/build-essential(for compiling native extensions likeconfluent-kafka)librdkafka-dev(for Kafka connector)openjdk-17-jre-headless(in Docker ingestion image)
Python Packages (Base)
typing_extensions>= 4.8.0, < 5.0.0pydantic>= 2.4.0, < 3.0.0click>= 7.1.2, != 8.2.0, < 9.0.0PyYAML< 7.0.0docker< 8.0.0avro>= 1.11.3, < 1.13sentry-sdk>= 1.33.1, < 3.0.0sqlalchemy>= 1.4.39, < 2 (for SQL-based connectors)
Credentials
Credentials vary by connector. Common environment variables:
DATAHUB_GMS_URL: URL of the GMS server (default:http://localhost:8080)DATAHUB_GMS_TOKEN: Authentication token for GMS APIDATAHUB_TELEMETRY_ENABLED: Set tofalseto disable anonymous usage telemetry
Connector-specific credentials are configured in recipe YAML files, not environment variables.
Quick Install
# Install base CLI
pip install acryl-datahub
# Install with specific connectors
pip install 'acryl-datahub[mysql,snowflake,bigquery]'
# Install all connectors (large install)
pip install 'acryl-datahub[all]'
# For development
cd metadata-ingestion
../gradlew :metadata-ingestion:installDev
Code Evidence
Python version requirement from metadata-ingestion/setup.py:1147:
python_requires=">=3.10",
Click version avoidance from metadata-ingestion/setup.py:38-39:
# Avoiding click 8.2.0 due to https://github.com/pallets/click/issues/2894
"click>=7.1.2,!=8.2.0,<9.0.0",
Setuptools version constraint from metadata-ingestion/setup.py:33-34:
# setuptools 82.0.0 deprecated pkg_resource
"setuptools<82.0.0",
Pydantic version requirement from metadata-ingestion/setup.py:21:
"pydantic>=2.4.0,<3.0.0",
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
Failed building wheel for avro-python3 |
Outdated pip/setuptools/wheel | pip install --upgrade pip wheel setuptools && pip cache purge
|
error: command 'x86_64-linux-gnu-gcc' failed |
Missing C compiler or librdkafka headers |
apt install build-essential librdkafka-dev or pip install confluent_kafka==1.5.0
|
datahub: command not found |
PATH not configured for Python scripts | Use python3 -m datahub instead, or add Python scripts directory to PATH
|
| Pydantic 1.x/2.x conflict | Transitive dependency pulling wrong Pydantic version | Use virtual environments; do not install all connectors into one environment |
Compatibility Notes
- Linux ARM (aarch64): Some connectors are not available: DB2 (
ibm_db), SAP HANA (hdbcli), and Trino (trino[sqlalchemy]has limited ARM support). - Windows: Not officially supported for development. Use WSL2.
- macOS (Apple Silicon): Supported but some native extensions may require Rosetta or Homebrew-installed dependencies.
- Airflow Compatibility: The
typing_extensionsversion is constrained by Airflow compatibility requirements.
Related Pages
- Implementation:Datahub_project_Datahub_Pip_Install_Acryl_Datahub
- Implementation:Datahub_project_Datahub_Pipeline_Create_And_Run
- Implementation:Datahub_project_Datahub_PipelineConfig_From_Recipe
- Implementation:Datahub_project_Datahub_Datahub_Ingest_Dry_Run
- Implementation:Datahub_project_Datahub_DatahubRestSink_Write_Record