Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Datahub project Datahub Python Ingestion

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Data_Engineering
Last Updated 2026-02-09 17:00 GMT

Overview

Python 3.10+ runtime environment required for the DataHub CLI, metadata ingestion framework, and all Python-based source connectors.

Description

This environment provides the Python runtime for the acryl-datahub package, which includes the datahub CLI tool and the full metadata ingestion framework. The package uses setuptools with extensive extras_require declarations for 60+ source connectors. Key base dependencies include Pydantic 2.x, Click, PyYAML, and the Avro serialization library. The environment requires Python 3.10 or higher, with Python 3.11 recommended for development.

Usage

Use this environment for metadata ingestion via the CLI or Python library, developing new source connectors, and running smoke tests. It is the mandatory prerequisite for the Pip_Install_Acryl_Datahub, Pipeline_Create_And_Run, and PipelineConfig_From_Recipe implementations.

System Requirements

Category Requirement Notes
OS Linux (Ubuntu 24.04 preferred), macOS, Windows (limited) Docker images use Ubuntu 24.04 base
Python >= 3.10 (3.11 recommended) Smoke tests require exactly Python 3.11.x
pip Latest recommended Needed for wheel building of native extensions
RAM 2+ GB for basic ingestion Some connectors (BigQuery, Snowflake) may require more

Dependencies

System Packages

  • python3.10+ (or python3.11 for development)
  • pip (latest)
  • setuptools < 82.0.0 (82.0.0 deprecated pkg_resources)
  • wheel
  • gcc / build-essential (for compiling native extensions like confluent-kafka)
  • librdkafka-dev (for Kafka connector)
  • openjdk-17-jre-headless (in Docker ingestion image)

Python Packages (Base)

  • typing_extensions >= 4.8.0, < 5.0.0
  • pydantic >= 2.4.0, < 3.0.0
  • click >= 7.1.2, != 8.2.0, < 9.0.0
  • PyYAML < 7.0.0
  • docker < 8.0.0
  • avro >= 1.11.3, < 1.13
  • sentry-sdk >= 1.33.1, < 3.0.0
  • sqlalchemy >= 1.4.39, < 2 (for SQL-based connectors)

Credentials

Credentials vary by connector. Common environment variables:

  • DATAHUB_GMS_URL: URL of the GMS server (default: http://localhost:8080)
  • DATAHUB_GMS_TOKEN: Authentication token for GMS API
  • DATAHUB_TELEMETRY_ENABLED: Set to false to disable anonymous usage telemetry

Connector-specific credentials are configured in recipe YAML files, not environment variables.

Quick Install

# Install base CLI
pip install acryl-datahub

# Install with specific connectors
pip install 'acryl-datahub[mysql,snowflake,bigquery]'

# Install all connectors (large install)
pip install 'acryl-datahub[all]'

# For development
cd metadata-ingestion
../gradlew :metadata-ingestion:installDev

Code Evidence

Python version requirement from metadata-ingestion/setup.py:1147:

python_requires=">=3.10",

Click version avoidance from metadata-ingestion/setup.py:38-39:

# Avoiding click 8.2.0 due to https://github.com/pallets/click/issues/2894
"click>=7.1.2,!=8.2.0,<9.0.0",

Setuptools version constraint from metadata-ingestion/setup.py:33-34:

# setuptools 82.0.0 deprecated pkg_resource
"setuptools<82.0.0",

Pydantic version requirement from metadata-ingestion/setup.py:21:

"pydantic>=2.4.0,<3.0.0",

Common Errors

Error Message Cause Solution
Failed building wheel for avro-python3 Outdated pip/setuptools/wheel pip install --upgrade pip wheel setuptools && pip cache purge
error: command 'x86_64-linux-gnu-gcc' failed Missing C compiler or librdkafka headers apt install build-essential librdkafka-dev or pip install confluent_kafka==1.5.0
datahub: command not found PATH not configured for Python scripts Use python3 -m datahub instead, or add Python scripts directory to PATH
Pydantic 1.x/2.x conflict Transitive dependency pulling wrong Pydantic version Use virtual environments; do not install all connectors into one environment

Compatibility Notes

  • Linux ARM (aarch64): Some connectors are not available: DB2 (ibm_db), SAP HANA (hdbcli), and Trino (trino[sqlalchemy] has limited ARM support).
  • Windows: Not officially supported for development. Use WSL2.
  • macOS (Apple Silicon): Supported but some native extensions may require Rosetta or Homebrew-installed dependencies.
  • Airflow Compatibility: The typing_extensions version is constrained by Airflow compatibility requirements.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment