Implementation:PacktPublishing LLM Engineers Handbook Data Warehouse CLI
| Knowledge Sources | |
|---|---|
| Domains | Data_Management, CLI, Infrastructure |
| Last Updated | 2026-02-08 08:00 GMT |
Overview
Concrete tool for exporting and importing MongoDB data warehouse contents to and from JSON files via a Click CLI.
Description
The data_warehouse.py CLI tool provides bidirectional data portability for the project's MongoDB data warehouse. On export, it calls bulk_find() on each document class (ArticleDocument, PostDocument, RepositoryDocument, UserDocument), serializes results via to_mongo(), and writes them as JSON files named after the class. On import, it reads JSON files from a directory, matches filenames to document classes via a lookup dictionary, deserializes with from_mongo(), and calls bulk_insert() to load them into MongoDB. The CLI is built with Click, exposing --export-raw-data, --import-raw-data, and --data-dir options.
Usage
Use this tool to back up the data warehouse before destructive operations, to share raw crawled data with collaborators without requiring direct database access, or to seed a fresh MongoDB instance with previously collected data. This is the primary mechanism for data warehouse portability in the project.
Code Reference
Source Location
- Repository: PacktPublishing_LLM_Engineers_Handbook
- File: tools/data_warehouse.py
- Lines: 1-99
Signature
@click.command()
@click.option("--export-raw-data", is_flag=True, default=False,
help="Whether to export your data warehouse to a JSON file.")
@click.option("--import-raw-data", is_flag=True, default=False,
help="Whether to import a JSON file into your data warehouse.")
@click.option("--data-dir", default=Path("data/data_warehouse_raw_data"), type=Path,
help="Path to the directory containing data warehouse raw data JSON files.")
def main(export_raw_data: bool, import_raw_data: bool, data_dir: Path) -> None:
"""CLI entry point for data warehouse export/import operations."""
Import
# Typically invoked as a script, not imported
# python tools/data_warehouse.py --export-raw-data
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --export-raw-data | flag | No* | Export all document collections to JSON files |
| --import-raw-data | flag | No* | Import JSON files into MongoDB collections |
| --data-dir | Path | No | Directory for JSON files (default: data/data_warehouse_raw_data) |
*At least one of --export-raw-data or --import-raw-data must be specified.
Implicit Requirements:
- MongoDB must be running and accessible via settings.DATABASE_HOST
- Document classes must be importable from llm_engineering.domain.documents
Outputs
| Name | Type | Description |
|---|---|---|
| ArticleDocument.json | File | Serialized article documents (on export) |
| PostDocument.json | File | Serialized post documents (on export) |
| RepositoryDocument.json | File | Serialized repository documents (on export) |
| UserDocument.json | File | Serialized user documents (on export) |
| MongoDB collections | DB side effect | Populated collections (on import) |
Usage Examples
Exporting Data Warehouse
# Export all collections to default directory
python tools/data_warehouse.py --export-raw-data
# Export to a custom directory
python tools/data_warehouse.py --export-raw-data --data-dir ./backup/2024-01-15
Importing Data Warehouse
# Import from default directory
python tools/data_warehouse.py --import-raw-data
# Import from a custom directory
python tools/data_warehouse.py --import-raw-data --data-dir ./shared_data
Full Round-Trip
# Export current state
python tools/data_warehouse.py --export-raw-data --data-dir ./backup
# Later, restore from backup
python tools/data_warehouse.py --import-raw-data --data-dir ./backup