Principle:Datahub project Datahub Recipe Configuration
| Field | Value |
|---|---|
| Principle Name | Recipe Configuration |
| Overview | A declarative configuration pattern for defining metadata ingestion pipelines using YAML recipe files. |
| Status | Active |
| Domains | Data_Integration, Metadata_Management |
| Related Implementations | Datahub_project_Datahub_PipelineConfig |
| Last Updated | 2026-02-10 |
| Knowledge Sources | DataHub Repository |
Description
Recipe configuration separates the concerns of what metadata to extract (source), how to transform it (transformers), and where to send it (sink). Each component is dynamically loaded by type name, enabling a plugin architecture. The recipe is a YAML (or TOML) file that specifies the full pipeline topology without requiring any imperative code.
A recipe file contains these top-level sections:
- source -- Defines the metadata source connector type and its configuration. The
typefield maps to a registered source plugin (e.g.,snowflake,bigquery,mysql). - sink -- (Optional) Defines where metadata is sent. Defaults to
datahub-restif not specified. Alternative sinks includefile,console, anddatahub-kafka. - transformers -- (Optional) A list of transformers that modify metadata records between extraction and loading. Each transformer has a
typeand optionalconfig. - datahub_api -- (Optional) Connection configuration for the DataHub GMS server, including server URL and authentication token.
- run_id -- (Optional) A unique identifier for this ingestion run. Auto-generated if not specified.
- pipeline_name -- (Optional) A human-readable name for the pipeline.
Usage
Use recipe configuration when defining a new metadata ingestion pipeline from any supported data source to DataHub. Recipes are the standard way to configure and execute ingestion, whether running locally, in CI/CD, or via scheduled orchestration.
Example Recipe
source:
type: snowflake
config:
account_id: "myaccount"
username: "datahub_user"
password: "${SNOWFLAKE_PASSWORD}"
warehouse: "COMPUTE_WH"
role: "DATAHUB_ROLE"
platform_instance: "production"
transformers:
- type: simple_add_dataset_tags
config:
tag_urns:
- "urn:li:tag:production"
sink:
type: datahub-rest
config:
server: "http://localhost:8080"
token: "${DATAHUB_TOKEN}"
Theoretical Basis
Recipe configuration follows the declarative configuration pattern: instead of writing imperative code, users specify the desired pipeline topology in YAML. The system resolves type names to concrete classes via a plugin registry (using Python's entry_points mechanism).
This approach provides several benefits:
- Separation of concerns -- Configuration is separate from execution logic
- Composability -- Sources, transformers, and sinks can be mixed and matched independently
- Environment variable substitution -- Secrets and environment-specific values are resolved at load time via
${ENV_VAR}syntax - Validation -- The configuration is validated against pydantic models before execution, providing early error detection
Constraints
- The
source.typemust match a registered source plugin name - The
sink.typemust match a registered sink plugin name - Transformer types must match registered transformer plugin names
- Configuration values can reference environment variables using
${VAR_NAME}syntax
Related Pages
- Implemented by: Datahub_project_Datahub_PipelineConfig