Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Datahub project Datahub Datahub Ingest Dry Run

From Leeroopedia


Property Value
Page Type Implementation (API Doc)
Workflow Metadata_Ingestion_Pipeline
API CLI datahub ingest run -c recipe.yml --dry-run or run() click command
Source File metadata-ingestion/src/datahub/cli/ingest_cli.py
Repository https://github.com/datahub-project/datahub
Implements Principle:Datahub_project_Datahub_Connection_Validation
Last Updated 2026-02-09 17:00 GMT

Overview

Description

The datahub ingest run command is the primary CLI entry point for executing metadata ingestion pipelines. It supports several validation-oriented flags that allow users to test connectivity and verify configuration without committing metadata to the sink:

  • --dry-run: Runs the full pipeline (source extraction, transformation, work unit generation) but skips all sink writes. This exercises the entire data flow without persisting any metadata.
  • --test-source-connection: Bypasses the pipeline entirely and invokes the source's dedicated connection test via ConnectionManager.test_source_connection(). Returns a structured connection report.
  • --preview: Limits the number of work units extracted from the source to a configurable count (default 10 via --preview-workunits), enabling rapid feedback on the metadata being produced.

The command is implemented as a Click command function decorated with telemetry tracking and upgrade checking. It loads the recipe file, resolves environment variables, and delegates to Pipeline.create() and Pipeline.run() for execution.

Usage

# Full dry run - exercises entire pipeline without writing to sink
datahub ingest run -c recipe.yml --dry-run

# Preview mode - extract only 5 work units, skip sink writes
datahub ingest run -c recipe.yml --dry-run --preview --preview-workunits 5

# Test source connection only - fastest validation
datahub ingest run -c recipe.yml --test-source-connection

# Test connection and write report to file
datahub ingest run -c recipe.yml --test-source-connection --report-to connection_report.json

# Strict warnings mode - treat warnings as errors (non-zero exit code)
datahub ingest run -c recipe.yml --strict-warnings

Code Reference

Source Location

File Lines Description
metadata-ingestion/src/datahub/cli/ingest_cli.py L45-258 run() click command definition with all options and execution logic
metadata-ingestion/src/datahub/cli/ingest_cli.py L481-495 _test_source_connection() helper function
metadata-ingestion/src/datahub/ingestion/run/connection.py ConnectionManager.test_source_connection() implementation

Signature

@ingest.command()
@click.option("-c", "--config", type=click.Path(dir_okay=False), required=True,
              help="Config file in .toml or .yaml format.")
@click.option("-n", "--dry-run", type=bool, is_flag=True, default=False,
              help="Perform a dry run of the ingestion, essentially skipping writing to sink.")
@click.option("--preview", type=bool, is_flag=True, default=False,
              help="Perform limited ingestion from the source to the sink to get a quick preview.")
@click.option("--preview-workunits", type=int, default=10,
              help="The number of workunits to produce for preview.")
@click.option("--strict-warnings/--no-strict-warnings", default=False,
              help="If enabled, ingestion runs with warnings will yield a non-zero error code")
@click.option("--test-source-connection", type=bool, is_flag=True, default=False,
              help="When set, ingestion will only test the source connection details from the recipe")
@click.option("--report-to", type=str, default="datahub",
              help="Provide a destination to send a structured report from the run.")
@click.option("--no-default-report", type=bool, is_flag=True, default=False,
              help="Turn off default reporting of ingestion results to DataHub")
@click.option("--no-spinner", type=bool, is_flag=True, default=False,
              help="Turn off spinner")
@click.option("--no-progress", type=bool, is_flag=True, default=False,
              help="If enabled, mute intermediate progress ingestion reports")
def run(
    config: str,
    dry_run: bool,
    preview: bool,
    strict_warnings: bool,
    preview_workunits: int,
    test_source_connection: bool,
    report_to: Optional[str],
    no_default_report: bool,
    no_spinner: bool,
    no_progress: bool,
    record: bool,
    record_password: Optional[str],
    record_output_path: Optional[str],
    no_s3_upload: bool,
    no_secret_redaction: bool,
) -> None:

Import

from datahub.cli.ingest_cli import ingest

I/O Contract

Direction Type Description
Input -c/--config (str, required) Path to YAML or TOML recipe file
Input --dry-run (bool, flag) When set, skips all sink writes
Input --preview (bool, flag) When set, limits work unit extraction to --preview-workunits count
Input --preview-workunits (int, default 10) Number of work units to extract in preview mode
Input --test-source-connection (bool, flag) When set, tests only the source connection and exits
Input --strict-warnings (bool, flag) When set, treat warnings as failures (non-zero exit code)
Input --report-to (str, default "datahub") Destination for structured report; "datahub" sends to server, other values are treated as file paths
Output Exit code (int) 0 on success, 1 on failure or warnings (if strict mode)
Output Terminal output Color-coded pipeline summary with record counts, failures, and warnings
Output Connection report (JSON) When --test-source-connection is used with --report-to file.json

Usage Examples

Example 1: Dry run to validate a new MySQL recipe

# recipe.yml
# source:
#   type: mysql
#   config:
#     host_port: "db.example.com:3306"
#     database: "analytics"
#     username: "${MYSQL_USER}"
#     password: "${MYSQL_PASS}"

export MYSQL_USER=datahub
export MYSQL_PASS=secret

datahub ingest run -c recipe.yml --dry-run
# Output: Pipeline finished successfully; produced 0 events (sink writes skipped)

Example 2: Preview mode to sample metadata from Snowflake

datahub ingest run -c snowflake_recipe.yml --dry-run --preview --preview-workunits 5
# Extracts only 5 work units, prints source report showing sampled entities

Example 3: Test source connection and save report

datahub ingest run -c recipe.yml --test-source-connection --report-to /tmp/conn_report.json
# Exits immediately after testing connection
# Report written to /tmp/conn_report.json

Example 4: Programmatic dry run via Python

from datahub.ingestion.run.pipeline import Pipeline

recipe = {
    "source": {
        "type": "mysql",
        "config": {
            "host_port": "localhost:3306",
            "database": "test_db",
            "username": "root",
            "password": "root",
        },
    },
}

pipeline = Pipeline.create(recipe, dry_run=True, preview_mode=True, preview_workunits=5)
pipeline.run()
ret = pipeline.pretty_print_summary()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment