Principle:Datahub project Datahub Python SDK Installation
Metadata
| Field | Value |
|---|---|
| principle_name | Python SDK Installation |
| description | The practice of installing the DataHub Python SDK with the appropriate transport extras for programmatic metadata emission. |
| type | principle |
| status | active |
| last_updated | 2026-02-10 |
| version | 1.0 |
Overview
Python SDK Installation is the practice of installing the DataHub Python SDK (acryl-datahub) with the appropriate transport extras for programmatic metadata emission. The extras system (datahub-rest, datahub-kafka) controls which transport backend is available at runtime, enabling the SDK to emit metadata via HTTP to GMS or via Kafka topics.
Description
Python SDK installation provisions the emitter classes needed for programmatic metadata emission. The DataHub Python SDK is distributed as a single package (acryl-datahub) with an extras system that determines which transport dependencies are installed.
The package supports two primary transport backends:
- datahub-rest -- Installs the REST-based emitter, which communicates with DataHub GMS over HTTP. This extra pulls in the
requestslibrary and related HTTP dependencies through therest_commondependency set. - datahub-kafka -- Installs the Kafka-based emitter, which publishes metadata change events directly to Kafka topics. This extra pulls in
confluent_kafka[schemaregistry,avro]>=1.9.0andfastavro>=1.2.0.
Additional extras exist for specific use cases:
- sync-file-emitter -- For file-based emission with file locking support.
- datahub-lite -- A lightweight local metadata store using DuckDB.
The base package includes core dependencies such as pydantic>=2.4.0, avro>=1.11.3, and typing_extensions>=4.8.0, which are required regardless of the chosen transport backend.
Usage
Use Python SDK Installation when building Python applications that need to emit metadata to DataHub programmatically (not via CLI). The choice of extra depends on the deployment scenario:
- Choose datahub-rest when:
- Direct HTTP communication with GMS is preferred
- Synchronous emission with request-response confirmation is needed
- The application environment has network access to the GMS endpoint
- Simplicity of setup is a priority
- Choose datahub-kafka when:
- Kafka infrastructure is already in place
- Asynchronous, high-throughput emission is required
- Decoupling the emitter from the GMS backend is desired
- The application should continue operating even if GMS is temporarily unavailable
Both extras can be installed simultaneously to support multiple emission strategies within the same application.
Theoretical Basis
This principle follows a transport-abstracted SDK installation pattern. The same core package supports multiple emission backends selected at install time. This design decouples the metadata construction API from the transport layer, allowing developers to:
- Write metadata emission code once using the common
EmitterProtocol interface - Select the transport backend at deployment time via pip extras
- Switch between REST and Kafka transports without changing application code
The extras mechanism leverages Python's packaging ecosystem (setuptools extras_require) to manage optional dependency trees, ensuring that only the required transport dependencies are installed in each environment.
Related
- Implemented by: Datahub_project_Datahub_Pip_Install_Datahub_SDK
Implementation:Datahub_project_Datahub_Pip_Install_Datahub_SDK
- Related to: Datahub_project_Datahub_Emitter_Initialization