Principle:PacktPublishing LLM Engineers Handbook Data Warehouse Query
| Concept | Querying a NoSQL data warehouse for raw documents by author |
|---|---|
| Workflow | Feature_Engineering |
| Pipeline Stage | Data Ingestion |
| Repository | PacktPublishing/LLM-Engineers-Handbook |
| Implemented By | Implementation:PacktPublishing_LLM_Engineers_Handbook_NoSQLBaseDocument_Bulk_Find |
Overview
Data Warehouse Query is a foundational pattern in ML and NLP pipelines that deals with retrieving domain objects from a persistent store using filter criteria. In the context of feature engineering for LLM applications, this is the initial data ingestion step that feeds all downstream processing stages such as cleaning, chunking, embedding, and vector storage.
Theory
The Data Warehouse Query pattern follows the Repository pattern where a base document class provides query methods that abstract away the underlying database operations. Rather than coupling pipeline logic directly to MongoDB query syntax, the domain model exposes a clean, typed interface for retrieving documents.
Key characteristics of this pattern:
- Abstraction over storage — Callers interact with domain objects (e.g.,
ArticleDocument,PostDocument,RepositoryDocument) rather than raw database cursors or dictionaries. - Filter-based retrieval — Documents are retrieved using keyword-argument filters (e.g.,
author_id=uuid), which map directly to MongoDB query predicates. - Typed deserialization — Raw database records are automatically deserialized into strongly-typed Pydantic models via a
from_mongoclass method, ensuring downstream code operates on validated, well-structured data. - Bulk retrieval — The pattern supports fetching multiple documents in a single operation, which is essential for batch processing in ML pipelines.
How It Fits in Feature Engineering
In the PacktPublishing LLM Engineers Handbook, the feature engineering pipeline begins by querying a MongoDB data warehouse for all documents authored by a specific user. These raw documents — which may include articles, social media posts, and code repositories — are then passed through a series of transformation stages:
- Query (this pattern) — Retrieve raw documents from MongoDB by author
- Clean — Normalize and sanitize the raw text
- Chunk — Split cleaned documents into semantically coherent segments
- Embed — Generate dense vector representations of each chunk
- Store — Persist embedded chunks into a vector database
The query step is critical because it determines the scope and composition of the data that flows through the entire pipeline.
Design Considerations
- Error handling — Failed queries should return empty lists rather than raising exceptions, allowing the pipeline to degrade gracefully when the data warehouse is temporarily unavailable.
- Collection mapping — Each document subclass maps to its own MongoDB collection via the
Settings.nameinner class, following a convention-over-configuration approach. - Serialization round-trip — Documents are stored with MongoDB's native
_idfield and deserialized back into Pydantic models with a UUID-basedidfield, requiring a translation layer infrom_mongo.
Usage
Use this pattern when:
- Extracting raw documents from a MongoDB data warehouse as input to a feature engineering pipeline
- Building batch data ingestion steps that need to retrieve all documents matching a given filter
- Implementing the first stage of an ETL or ELT pipeline that transforms raw crawled data into ML-ready features
Example
from llm_engineering.domain.documents import ArticleDocument
# Retrieve all articles by a specific author
articles = ArticleDocument.bulk_find(author_id=author_uuid)
# Each article is a fully typed Pydantic model
for article in articles:
print(article.content[:100])