Principle:Datahub project Datahub Sample Data Loading

Field	Value
Principle Name	Sample Data Loading
Namespace	Datahub_project_Datahub
Workflow	Docker_Quickstart_Deployment
Type	Principle
Last Updated	2026-02-10
Source Repository	datahub-project/datahub
Domains	Deployment, Docker, Metadata_Management

Overview

The process of populating a fresh DataHub instance with demonstration metadata for evaluation and testing. Sample data loading uses the built-in demo-data source type to generate representative metadata (datasets, dashboards, users, tags) and ingest it via the REST sink.

Description

After deploying DataHub locally, the instance starts with an empty metadata store. While functional, an empty instance makes it difficult to explore DataHub's features such as search, lineage visualization, data profiling, and governance workflows. Sample data loading addresses this by populating the instance with a curated set of demonstration metadata.

How It Works

The sample data loading mechanism reuses DataHub's own ingestion framework. Rather than requiring an external data source, it uses a special built-in source type called demo-data that generates synthetic metadata programmatically. This metadata is then ingested through the standard datahub-rest sink, which communicates with the GMS backend via the REST API at http://localhost:8080.

The ingestion pipeline is constructed with a minimal recipe:

Source: demo-data -- Generates representative metadata entities
Sink: datahub-rest -- Sends metadata to GMS via REST API

What Gets Loaded

The demo data source generates a variety of metadata entities designed to showcase DataHub's capabilities:

Datasets -- Tables and views from various platforms (e.g., MySQL, Snowflake, Kafka)
Dashboards and Charts -- Visualization metadata with lineage to underlying datasets
Users and Groups -- Example corporate users and organizational groups
Tags and Glossary Terms -- Governance metadata for categorization and classification
Lineage Relationships -- Data flow connections between entities

Pre-Ingestion Health Check

Before running ingestion, the system verifies that the Docker quickstart stack is healthy by calling check_docker_quickstart(). If any containers are unhealthy, the user receives a clear error message suggesting they run datahub docker quickstart first.

Usage

After deploying DataHub locally to populate it with example metadata for evaluation.

# Load sample data (requires running DataHub instance)
datahub docker ingest-sample-data

# Load sample data with authentication token
datahub docker ingest-sample-data --token <your-token>

Typical scenarios:

First-time evaluation -- Exploring DataHub features with representative data
Demo preparation -- Setting up a DataHub instance for demonstrations
Development testing -- Having a populated instance for testing UI or API changes

Theoretical Basis

This principle follows the seed data pattern -- load representative sample data to demonstrate system capabilities without requiring external data sources. This pattern is common in application onboarding where an empty state provides poor user experience.

The approach leverages DataHub's own ingestion framework (dogfooding), which validates the ingestion pipeline while providing useful demonstration data. Using the same REST API that production ingestion uses ensures the demo data is fully functional and exercises the same code paths.

Knowledge Sources

DataHub GitHub Repository
DataHub Official Documentation
Source file: metadata-ingestion/src/datahub/cli/docker_cli.py

Related Pages

Implemented by: Datahub_project_Datahub_Docker_CLI_Ingest_Sample_Data

Implementation:Datahub_project_Datahub_Docker_CLI_Ingest_Sample_Data

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment