Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Datahub project Datahub Recipe Configuration

From Leeroopedia


Field Value
Principle Name Recipe Configuration
Overview A declarative configuration pattern for defining metadata ingestion pipelines using YAML recipe files.
Status Active
Domains Data_Integration, Metadata_Management
Related Implementations Datahub_project_Datahub_PipelineConfig
Last Updated 2026-02-10
Knowledge Sources DataHub Repository

Description

Recipe configuration separates the concerns of what metadata to extract (source), how to transform it (transformers), and where to send it (sink). Each component is dynamically loaded by type name, enabling a plugin architecture. The recipe is a YAML (or TOML) file that specifies the full pipeline topology without requiring any imperative code.

A recipe file contains these top-level sections:

  • source -- Defines the metadata source connector type and its configuration. The type field maps to a registered source plugin (e.g., snowflake, bigquery, mysql).
  • sink -- (Optional) Defines where metadata is sent. Defaults to datahub-rest if not specified. Alternative sinks include file, console, and datahub-kafka.
  • transformers -- (Optional) A list of transformers that modify metadata records between extraction and loading. Each transformer has a type and optional config.
  • datahub_api -- (Optional) Connection configuration for the DataHub GMS server, including server URL and authentication token.
  • run_id -- (Optional) A unique identifier for this ingestion run. Auto-generated if not specified.
  • pipeline_name -- (Optional) A human-readable name for the pipeline.

Usage

Use recipe configuration when defining a new metadata ingestion pipeline from any supported data source to DataHub. Recipes are the standard way to configure and execute ingestion, whether running locally, in CI/CD, or via scheduled orchestration.

Example Recipe

source:
  type: snowflake
  config:
    account_id: "myaccount"
    username: "datahub_user"
    password: "${SNOWFLAKE_PASSWORD}"
    warehouse: "COMPUTE_WH"
    role: "DATAHUB_ROLE"
    platform_instance: "production"

transformers:
  - type: simple_add_dataset_tags
    config:
      tag_urns:
        - "urn:li:tag:production"

sink:
  type: datahub-rest
  config:
    server: "http://localhost:8080"
    token: "${DATAHUB_TOKEN}"

Theoretical Basis

Recipe configuration follows the declarative configuration pattern: instead of writing imperative code, users specify the desired pipeline topology in YAML. The system resolves type names to concrete classes via a plugin registry (using Python's entry_points mechanism).

This approach provides several benefits:

  • Separation of concerns -- Configuration is separate from execution logic
  • Composability -- Sources, transformers, and sinks can be mixed and matched independently
  • Environment variable substitution -- Secrets and environment-specific values are resolved at load time via ${ENV_VAR} syntax
  • Validation -- The configuration is validated against pydantic models before execution, providing early error detection

Constraints

  • The source.type must match a registered source plugin name
  • The sink.type must match a registered sink plugin name
  • Transformer types must match registered transformer plugin names
  • Configuration values can reference environment variables using ${VAR_NAME} syntax

Related Pages

Implementation:Datahub_project_Datahub_PipelineConfig

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment