Implementation:PacktPublishing LLM Engineers Handbook NoSQLBaseDocument Bulk Find

Type	API Doc
API	`NoSQLBaseDocument.bulk_find(cls, **filter_options) -> list[T]`
Source	llm_engineering/domain/base/nosql.py:L122-130
Repository	PacktPublishing/LLM-Engineers-Handbook
Implements	Principle:PacktPublishing_LLM_Engineers_Handbook_Data_Warehouse_Query

Overview

The bulk_find class method on NoSQLBaseDocument retrieves multiple documents from a MongoDB collection based on arbitrary filter criteria. It is the primary data ingestion API used by the feature engineering pipeline to load raw documents (articles, posts, repositories) from the data warehouse.

API Signature

@classmethod
def bulk_find(cls, **filter_options) -> list:

Parameters

Parameter	Type	Description
`**filter_options`	`dict` (keyword arguments)	MongoDB query filter predicates passed directly to `collection.find()`. Common filters include `author_id=uuid` to retrieve all documents by a specific author.

Return Value

Type	Description
`list[T]`	A list of deserialized document instances matching the filter. `T` is the concrete subclass of `NoSQLBaseDocument` (e.g., `ArticleDocument`, `PostDocument`, `RepositoryDocument`). Returns an empty list if no documents match or if an error occurs.

Source Code

@classmethod
def bulk_find(cls, **filter_options) -> list:
    collection = cls._get_collection()
    try:
        instances = collection.find(filter_options)
        return [cls.from_mongo(instance) for instance in instances]
    except Exception:
        logger.exception("Failed to retrieve documents.")
        return []

Import

from llm_engineering.domain.base.nosql import NoSQLBaseDocument

In practice, callers import concrete subclasses rather than the base class:

from llm_engineering.domain.documents import ArticleDocument, PostDocument, RepositoryDocument

How It Works

Collection resolution — cls._get_collection() returns the PyMongo collection object for the calling class. Each document subclass defines its collection name via an inner Settings class with a name attribute.
Query execution — The filter_options keyword arguments are passed directly to PyMongo's collection.find(), which returns a cursor over matching MongoDB documents.
Deserialization — Each raw MongoDB document (a dictionary with _id as an ObjectId) is converted into a typed Pydantic model via cls.from_mongo(instance). This method handles the _id to id field mapping.
Error handling — If any exception occurs during the query or deserialization, it is logged via loguru and an empty list is returned, ensuring the pipeline does not crash on transient database errors.

Usage Example

from llm_engineering.domain.documents import ArticleDocument

# Retrieve all articles by a specific author
author_uuid = "550e8400-e29b-41d4-a716-446655440000"
articles = ArticleDocument.bulk_find(author_id=author_uuid)

print(f"Found {len(articles)} articles")
for article in articles:
    print(f"  - {article.id}: {article.content[:80]}...")

External Dependencies

Dependency	Purpose
pymongo	MongoDB driver; provides `collection.find()` for executing queries
pydantic	Data validation and serialization; `NoSQLBaseDocument` extends Pydantic's `BaseModel`
loguru	Structured logging; used for exception reporting on query failure

Design Notes

The method is a classmethod, meaning it is called on the document subclass itself (e.g., ArticleDocument.bulk_find(...)), and the returned list contains instances of that specific subclass.
The use of **filter_options provides a flexible, Pythonic interface that maps directly to MongoDB's query syntax without requiring callers to construct query dictionaries manually.
The fail-safe return of an empty list on exception is a deliberate design choice that prioritizes pipeline resilience over strict error propagation.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment