Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:PacktPublishing LLM Engineers Handbook Data Warehouse CLI

From Leeroopedia


Knowledge Sources
Domains Data_Management, CLI, Infrastructure
Last Updated 2026-02-08 08:00 GMT

Overview

Concrete tool for exporting and importing MongoDB data warehouse contents to and from JSON files via a Click CLI.

Description

The data_warehouse.py CLI tool provides bidirectional data portability for the project's MongoDB data warehouse. On export, it calls bulk_find() on each document class (ArticleDocument, PostDocument, RepositoryDocument, UserDocument), serializes results via to_mongo(), and writes them as JSON files named after the class. On import, it reads JSON files from a directory, matches filenames to document classes via a lookup dictionary, deserializes with from_mongo(), and calls bulk_insert() to load them into MongoDB. The CLI is built with Click, exposing --export-raw-data, --import-raw-data, and --data-dir options.

Usage

Use this tool to back up the data warehouse before destructive operations, to share raw crawled data with collaborators without requiring direct database access, or to seed a fresh MongoDB instance with previously collected data. This is the primary mechanism for data warehouse portability in the project.

Code Reference

Source Location

Signature

@click.command()
@click.option("--export-raw-data", is_flag=True, default=False,
              help="Whether to export your data warehouse to a JSON file.")
@click.option("--import-raw-data", is_flag=True, default=False,
              help="Whether to import a JSON file into your data warehouse.")
@click.option("--data-dir", default=Path("data/data_warehouse_raw_data"), type=Path,
              help="Path to the directory containing data warehouse raw data JSON files.")
def main(export_raw_data: bool, import_raw_data: bool, data_dir: Path) -> None:
    """CLI entry point for data warehouse export/import operations."""

Import

# Typically invoked as a script, not imported
# python tools/data_warehouse.py --export-raw-data

I/O Contract

Inputs

Name Type Required Description
--export-raw-data flag No* Export all document collections to JSON files
--import-raw-data flag No* Import JSON files into MongoDB collections
--data-dir Path No Directory for JSON files (default: data/data_warehouse_raw_data)

*At least one of --export-raw-data or --import-raw-data must be specified.

Implicit Requirements:

  • MongoDB must be running and accessible via settings.DATABASE_HOST
  • Document classes must be importable from llm_engineering.domain.documents

Outputs

Name Type Description
ArticleDocument.json File Serialized article documents (on export)
PostDocument.json File Serialized post documents (on export)
RepositoryDocument.json File Serialized repository documents (on export)
UserDocument.json File Serialized user documents (on export)
MongoDB collections DB side effect Populated collections (on import)

Usage Examples

Exporting Data Warehouse

# Export all collections to default directory
python tools/data_warehouse.py --export-raw-data

# Export to a custom directory
python tools/data_warehouse.py --export-raw-data --data-dir ./backup/2024-01-15

Importing Data Warehouse

# Import from default directory
python tools/data_warehouse.py --import-raw-data

# Import from a custom directory
python tools/data_warehouse.py --import-raw-data --data-dir ./shared_data

Full Round-Trip

# Export current state
python tools/data_warehouse.py --export-raw-data --data-dir ./backup

# Later, restore from backup
python tools/data_warehouse.py --import-raw-data --data-dir ./backup

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment