Implementation:Datahub project Datahub Docker CLI Ingest Sample Data
| Field | Value |
|---|---|
| Implementation Name | Docker CLI Ingest Sample Data |
| Namespace | Datahub_project_Datahub |
| Workflow | Docker_Quickstart_Deployment |
| Type | API Doc |
| Language | Python |
| Last Updated | 2026-02-10 |
| Source Repository | datahub-project/datahub |
| Source File | metadata-ingestion/src/datahub/cli/docker_cli.py, lines 864-903
|
| Domains | Deployment, Docker, Metadata_Management |
Overview
The ingest_sample_data() function populates a running DataHub instance with demonstration metadata using the built-in demo-data ingestion source and the datahub-rest sink.
Function Signature
@docker.command()
@click.option(
"--token",
type=str,
is_flag=False,
default=None,
help="The token to be used when ingesting, used when datahub is deployed with METADATA_SERVICE_AUTH_ENABLED=true",
)
@upgrade.check_upgrade
def ingest_sample_data(token: Optional[str]) -> None:
CLI Usage
datahub docker ingest-sample-data [OPTIONS]
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--token |
str |
None | Authentication token for DataHub instances with METADATA_SERVICE_AUTH_ENABLED=true.
|
Return Value
Returns None. Exits via sys.exit(ret) where ret is the return code from pipeline.pretty_print_summary() (0 for success, 1 for failure).
Implementation Details
Step 1: Docker Health Check (lines 877-882)
status = check_docker_quickstart()
if not status.is_ok():
raise status.to_exception(
header="Docker is not ready:",
footer="Try running `datahub docker quickstart` first.",
)
Verifies all DataHub containers are healthy before attempting ingestion. Raises QuickstartError with actionable guidance if containers are unhealthy.
Step 2: Pipeline Construction (lines 886-898)
recipe: dict = {
"source": {
"type": "demo-data",
"config": {},
},
"sink": {
"type": "datahub-rest",
"config": {"server": "http://localhost:8080"},
},
}
if token is not None:
recipe["sink"]["config"]["token"] = token
Constructs a minimal ingestion recipe with:
- Source:
demo-data-- The built-in demo data generator, registered insetup.pyentry points asdemo-data = datahub.ingestion.source.demo_data.DemoDataSource - Sink:
datahub-rest-- REST emitter targetinghttp://localhost:8080(the GMS service)
Step 3: Pipeline Execution (lines 900-903)
pipeline = Pipeline.create(recipe)
pipeline.run()
ret = pipeline.pretty_print_summary()
sys.exit(ret)
Creates and runs the ingestion pipeline using the standard Pipeline class. After completion, prints a summary of ingested entities and exits with the appropriate return code.
I/O Contract
Input
| Input | Source | Description |
|---|---|---|
| Docker container status | Docker daemon | Health check via check_docker_quickstart()
|
| Authentication token | CLI argument | Optional, for authenticated DataHub instances |
Output
| Output | Destination | Description |
|---|---|---|
| Metadata entities | GMS REST API at http://localhost:8080 |
Demo datasets, dashboards, users, tags, etc. |
| Ingestion summary | stdout | Pipeline execution statistics |
Dependencies
| Dependency | Import Path | Purpose |
|---|---|---|
check_docker_quickstart |
datahub.cli.docker_check |
Pre-ingestion health check |
Pipeline |
datahub.ingestion.run.pipeline |
Ingestion pipeline framework |
upgrade.check_upgrade |
datahub.upgrade |
CLI upgrade check decorator |
Source Registration
The demo-data source is registered as a plugin entry point in setup.py (line 1036):
"datahub.ingestion.source.plugins": [
# ...
"demo-data = datahub.ingestion.source.demo_data.DemoDataSource",
# ...
]
Usage Examples
# Load sample data (default, no authentication)
datahub docker ingest-sample-data
# Load sample data with authentication token
datahub docker ingest-sample-data --token eyJhbGciOiJIUzI1NiJ9...
# Programmatic equivalent
from datahub.ingestion.run.pipeline import Pipeline
recipe = {
"source": {"type": "demo-data", "config": {}},
"sink": {
"type": "datahub-rest",
"config": {"server": "http://localhost:8080"},
},
}
pipeline = Pipeline.create(recipe)
pipeline.run()
pipeline.pretty_print_summary()