Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Datahub project Datahub Docker CLI Ingest Sample Data

From Leeroopedia


Field Value
Implementation Name Docker CLI Ingest Sample Data
Namespace Datahub_project_Datahub
Workflow Docker_Quickstart_Deployment
Type API Doc
Language Python
Last Updated 2026-02-10
Source Repository datahub-project/datahub
Source File metadata-ingestion/src/datahub/cli/docker_cli.py, lines 864-903
Domains Deployment, Docker, Metadata_Management

Overview

The ingest_sample_data() function populates a running DataHub instance with demonstration metadata using the built-in demo-data ingestion source and the datahub-rest sink.

Function Signature

@docker.command()
@click.option(
    "--token",
    type=str,
    is_flag=False,
    default=None,
    help="The token to be used when ingesting, used when datahub is deployed with METADATA_SERVICE_AUTH_ENABLED=true",
)
@upgrade.check_upgrade
def ingest_sample_data(token: Optional[str]) -> None:

CLI Usage

datahub docker ingest-sample-data [OPTIONS]

Parameters

Parameter Type Default Description
--token str None Authentication token for DataHub instances with METADATA_SERVICE_AUTH_ENABLED=true.

Return Value

Returns None. Exits via sys.exit(ret) where ret is the return code from pipeline.pretty_print_summary() (0 for success, 1 for failure).

Implementation Details

Step 1: Docker Health Check (lines 877-882)

status = check_docker_quickstart()
if not status.is_ok():
    raise status.to_exception(
        header="Docker is not ready:",
        footer="Try running `datahub docker quickstart` first.",
    )

Verifies all DataHub containers are healthy before attempting ingestion. Raises QuickstartError with actionable guidance if containers are unhealthy.

Step 2: Pipeline Construction (lines 886-898)

recipe: dict = {
    "source": {
        "type": "demo-data",
        "config": {},
    },
    "sink": {
        "type": "datahub-rest",
        "config": {"server": "http://localhost:8080"},
    },
}

if token is not None:
    recipe["sink"]["config"]["token"] = token

Constructs a minimal ingestion recipe with:

  • Source: demo-data -- The built-in demo data generator, registered in setup.py entry points as demo-data = datahub.ingestion.source.demo_data.DemoDataSource
  • Sink: datahub-rest -- REST emitter targeting http://localhost:8080 (the GMS service)

Step 3: Pipeline Execution (lines 900-903)

pipeline = Pipeline.create(recipe)
pipeline.run()
ret = pipeline.pretty_print_summary()
sys.exit(ret)

Creates and runs the ingestion pipeline using the standard Pipeline class. After completion, prints a summary of ingested entities and exits with the appropriate return code.

I/O Contract

Input

Input Source Description
Docker container status Docker daemon Health check via check_docker_quickstart()
Authentication token CLI argument Optional, for authenticated DataHub instances

Output

Output Destination Description
Metadata entities GMS REST API at http://localhost:8080 Demo datasets, dashboards, users, tags, etc.
Ingestion summary stdout Pipeline execution statistics

Dependencies

Dependency Import Path Purpose
check_docker_quickstart datahub.cli.docker_check Pre-ingestion health check
Pipeline datahub.ingestion.run.pipeline Ingestion pipeline framework
upgrade.check_upgrade datahub.upgrade CLI upgrade check decorator

Source Registration

The demo-data source is registered as a plugin entry point in setup.py (line 1036):

"datahub.ingestion.source.plugins": [
    # ...
    "demo-data = datahub.ingestion.source.demo_data.DemoDataSource",
    # ...
]

Usage Examples

# Load sample data (default, no authentication)
datahub docker ingest-sample-data

# Load sample data with authentication token
datahub docker ingest-sample-data --token eyJhbGciOiJIUzI1NiJ9...
# Programmatic equivalent
from datahub.ingestion.run.pipeline import Pipeline

recipe = {
    "source": {"type": "demo-data", "config": {}},
    "sink": {
        "type": "datahub-rest",
        "config": {"server": "http://localhost:8080"},
    },
}
pipeline = Pipeline.create(recipe)
pipeline.run()
pipeline.pretty_print_summary()

Knowledge Sources

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment