Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Datahub project Datahub Python 3 10 Ingestion Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Metadata_Ingestion, Python
Last Updated 2026-02-10 00:00 GMT

Overview

Python 3.10+ environment with pydantic v2, click CLI framework, and optional source-specific extras for running DataHub metadata ingestion and actions.

Description

This environment provides the Python runtime required by the metadata-ingestion and datahub-actions packages. It requires Python 3.10 as the minimum version. Python 3.10 and 3.11 are actively tested; Python 3.12+ triggers a runtime warning and is not officially supported. The core stack includes pydantic v2 for configuration validation, click for CLI interactions, PyYAML for recipe parsing, and aiohttp for async HTTP operations. Source-specific connectors (Snowflake, BigQuery, Kafka, etc.) are installed via pip extras.

Usage

Use this environment for any CLI Metadata Ingestion, Python SDK Metadata Emission, or Actions Framework workflow. It is the mandatory prerequisite for running the DataHub CLI (`datahub ingest`), programmatic emitters (`DataHubRestEmitter`), and the actions daemon (`datahub-actions`).

System Requirements

Category Requirement Notes
OS Linux, macOS, Windows (WSL2) Linux recommended for production
Python 3.10 or 3.11 3.12+ triggers warning; 3.10 minimum enforced in setup.py
Disk 2GB+ Varies with extras installed (full `[all]` can be large)

Dependencies

System Packages

  • `python3-dev` (Linux) or Xcode CLI tools (macOS)
  • `python3-venv` (Linux, for virtual environment creation)
  • `openldap-dev` (only if using LDAP source connector)

Python Packages (Core)

  • `pydantic` >= 2.4.0, < 3.0.0
  • `pydantic_core` != 2.41.3 (excludes buggy release)
  • `click` >= 7.1.2, != 8.2.0, < 9.0.0
  • `PyYAML` < 7.0.0
  • `aiohttp` < 4
  • `avro` >= 1.11.3, < 1.13
  • `requests` (via aiohttp/urllib3)
  • `python-dateutil` >= 2.8.0, < 3.0.0
  • `setuptools` < 82.0.0

Python Packages (Key Extras)

  • [kafka]: `confluent-kafka` >= 2.10.1, < 2.13.0; `fastavro` >= 1.2.0
  • [snowflake]: `snowflake-connector-python` >= 3.4.0; `pandas` < 3.0.0
  • [bigquery]: `google-cloud-bigquery` < 4.0.0; `google-cloud-datacatalog` >= 1.5.0
  • [databricks]: `databricks-sdk` >= 0.30.0; `pyspark` ~= 3.5.6
  • [s3]: `pyspark` ~= 3.5.6 (use `[s3-slim]` to avoid PySpark)
  • [iceberg]: `pyiceberg` >= 0.9.0, <= 0.10.0; `pydantic` < 2.12

Credentials

The following environment variables must be set for authentication:

  • `DATAHUB_GMS_URL`: Complete GMS server URL (e.g., `http://localhost:8080`)
  • `DATAHUB_GMS_TOKEN`: Authentication token for GMS API access
  • `DATAHUB_GMS_HOST`: GMS host (deprecated fallback for DATAHUB_GMS_URL)
  • `DATAHUB_GMS_PORT`: GMS port number (deprecated fallback)
  • `DATAHUB_GMS_PROTOCOL`: Protocol for GMS connection, `http` or `https` (default: `http`)
  • `DATAHUB_USERNAME`: Username for generating access tokens
  • `DATAHUB_PASSWORD`: Password for generating access tokens

Quick Install

# Install core CLI
pip install 'acryl-datahub'

# Install with specific source extras
pip install 'acryl-datahub[snowflake,bigquery,kafka]'

# Install actions framework
pip install 'acryl-datahub-actions'

# Install actions with Kafka source
pip install 'acryl-datahub-actions[kafka]'

Code Evidence

Python version warning from `entrypoints.py:55-60`:

if sys.version_info >= (3, 12):
    click.secho(
        "Python versions above 3.11 are not actively tested with yet. Please use Python 3.11 for now.",
        fg="red",
        err=True,
    )

Python minimum version from `setup.py:1147`:

python_requires=">=3.10"

Pydantic core exclusion from `setup.py:23`:

# https://github.com/pydantic/pydantic-core/issues/1841
"pydantic_core!=2.41.3,<3.0.0",

Click version exclusion from `setup.py:39`:

"click>=7.1.2,!=8.2.0,<9.0.0",

Common Errors

Error Message Cause Solution
`Python versions above 3.11 are not actively tested` Python 3.12+ detected at CLI startup Downgrade to Python 3.10 or 3.11
`ImportError: No module named 'datahub'` Package not installed in active venv Run `pip install acryl-datahub`
`confluent_kafka` build failure Missing librdkafka system library Install `librdkafka-dev` (apt) or `librdkafka` (brew)
`pydantic.errors.PydanticUserError` Pydantic v1 API used with v2 Ensure pydantic >= 2.4.0 is installed

Compatibility Notes

  • Nix/Immutable filesystems: Set `DATAHUB_VENV_USE_COPIES=true` if venv creation fails due to symlink restrictions.
  • Windows: Not officially supported; use WSL2 for development.
  • PyIceberg extras: Requires `pydantic < 2.12` which may conflict with other extras.
  • PySpark extras: The `[s3]` extra includes PySpark 3.5.6; use `[s3-slim]` for a lightweight alternative without PySpark.
  • numpy constraint: Several extras (feast, cassandra) require `numpy < 2` due to binary incompatibility.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment