Principle:Datahub project Datahub Recipe Configuration

Field	Value
Principle Name	Recipe Configuration
Overview	A declarative configuration pattern for defining metadata ingestion pipelines using YAML recipe files.
Status	Active
Domains	Data_Integration, Metadata_Management
Related Implementations	Datahub_project_Datahub_PipelineConfig
Last Updated	2026-02-10
Knowledge Sources	DataHub Repository

Description

Recipe configuration separates the concerns of what metadata to extract (source), how to transform it (transformers), and where to send it (sink). Each component is dynamically loaded by type name, enabling a plugin architecture. The recipe is a YAML (or TOML) file that specifies the full pipeline topology without requiring any imperative code.

A recipe file contains these top-level sections:

source -- Defines the metadata source connector type and its configuration. The type field maps to a registered source plugin (e.g., snowflake, bigquery, mysql).
sink -- (Optional) Defines where metadata is sent. Defaults to datahub-rest if not specified. Alternative sinks include file, console, and datahub-kafka.
transformers -- (Optional) A list of transformers that modify metadata records between extraction and loading. Each transformer has a type and optional config.
datahub_api -- (Optional) Connection configuration for the DataHub GMS server, including server URL and authentication token.
run_id -- (Optional) A unique identifier for this ingestion run. Auto-generated if not specified.
pipeline_name -- (Optional) A human-readable name for the pipeline.

Usage

Use recipe configuration when defining a new metadata ingestion pipeline from any supported data source to DataHub. Recipes are the standard way to configure and execute ingestion, whether running locally, in CI/CD, or via scheduled orchestration.

Example Recipe

source:
  type: snowflake
  config:
    account_id: "myaccount"
    username: "datahub_user"
    password: "${SNOWFLAKE_PASSWORD}"
    warehouse: "COMPUTE_WH"
    role: "DATAHUB_ROLE"
    platform_instance: "production"

transformers:
  - type: simple_add_dataset_tags
    config:
      tag_urns:
        - "urn:li:tag:production"

sink:
  type: datahub-rest
  config:
    server: "http://localhost:8080"
    token: "${DATAHUB_TOKEN}"

Theoretical Basis

Recipe configuration follows the declarative configuration pattern: instead of writing imperative code, users specify the desired pipeline topology in YAML. The system resolves type names to concrete classes via a plugin registry (using Python's entry_points mechanism).

This approach provides several benefits:

Separation of concerns -- Configuration is separate from execution logic
Composability -- Sources, transformers, and sinks can be mixed and matched independently
Environment variable substitution -- Secrets and environment-specific values are resolved at load time via ${ENV_VAR} syntax
Validation -- The configuration is validated against pydantic models before execution, providing early error detection

Constraints

The source.type must match a registered source plugin name
The sink.type must match a registered sink plugin name
Transformer types must match registered transformer plugin names
Configuration values can reference environment variables using ${VAR_NAME} syntax

Related Pages

Implemented by: Datahub_project_Datahub_PipelineConfig

Implementation:Datahub_project_Datahub_PipelineConfig

Related: Datahub_project_Datahub_CLI_Package_Installation
Related: Datahub_project_Datahub_Client_Authentication
Related: Datahub_project_Datahub_Batch_Ingestion_Execution
Heuristic: Heuristic:Datahub_project_Datahub_Secret_Handling_And_Deprecation_Patterns

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment