Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Datahub project Datahub Sample Data Loading

From Leeroopedia


Field Value
Principle Name Sample Data Loading
Namespace Datahub_project_Datahub
Workflow Docker_Quickstart_Deployment
Type Principle
Last Updated 2026-02-10
Source Repository datahub-project/datahub
Domains Deployment, Docker, Metadata_Management

Overview

The process of populating a fresh DataHub instance with demonstration metadata for evaluation and testing. Sample data loading uses the built-in demo-data source type to generate representative metadata (datasets, dashboards, users, tags) and ingest it via the REST sink.

Description

After deploying DataHub locally, the instance starts with an empty metadata store. While functional, an empty instance makes it difficult to explore DataHub's features such as search, lineage visualization, data profiling, and governance workflows. Sample data loading addresses this by populating the instance with a curated set of demonstration metadata.

How It Works

The sample data loading mechanism reuses DataHub's own ingestion framework. Rather than requiring an external data source, it uses a special built-in source type called demo-data that generates synthetic metadata programmatically. This metadata is then ingested through the standard datahub-rest sink, which communicates with the GMS backend via the REST API at http://localhost:8080.

The ingestion pipeline is constructed with a minimal recipe:

  • Source: demo-data -- Generates representative metadata entities
  • Sink: datahub-rest -- Sends metadata to GMS via REST API

What Gets Loaded

The demo data source generates a variety of metadata entities designed to showcase DataHub's capabilities:

  • Datasets -- Tables and views from various platforms (e.g., MySQL, Snowflake, Kafka)
  • Dashboards and Charts -- Visualization metadata with lineage to underlying datasets
  • Users and Groups -- Example corporate users and organizational groups
  • Tags and Glossary Terms -- Governance metadata for categorization and classification
  • Lineage Relationships -- Data flow connections between entities

Pre-Ingestion Health Check

Before running ingestion, the system verifies that the Docker quickstart stack is healthy by calling check_docker_quickstart(). If any containers are unhealthy, the user receives a clear error message suggesting they run datahub docker quickstart first.

Usage

After deploying DataHub locally to populate it with example metadata for evaluation.

# Load sample data (requires running DataHub instance)
datahub docker ingest-sample-data

# Load sample data with authentication token
datahub docker ingest-sample-data --token <your-token>

Typical scenarios:

  • First-time evaluation -- Exploring DataHub features with representative data
  • Demo preparation -- Setting up a DataHub instance for demonstrations
  • Development testing -- Having a populated instance for testing UI or API changes

Theoretical Basis

This principle follows the seed data pattern -- load representative sample data to demonstrate system capabilities without requiring external data sources. This pattern is common in application onboarding where an empty state provides poor user experience.

The approach leverages DataHub's own ingestion framework (dogfooding), which validates the ingestion pipeline while providing useful demonstration data. Using the same REST API that production ingestion uses ensures the demo data is fully functional and exercises the same code paths.

Knowledge Sources

Related Pages

Implementation:Datahub_project_Datahub_Docker_CLI_Ingest_Sample_Data

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment