Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:PacktPublishing LLM Engineers Handbook Document Persistence

From Leeroopedia


Aspect Detail
Concept Object-Document Mapping (ODM) for NoSQL persistence
Workflow Digital_Data_ETL
Pipeline Role Data storage layer (cross-cutting concern used by all ETL steps)
Implemented By Implementation:PacktPublishing_LLM_Engineers_Handbook_NoSQLBaseDocument_Save

Overview

Document Persistence is the principle of bridging object-oriented domain models with document-based NoSQL databases through an Object-Document Mapping (ODM) abstraction. In the Digital Data ETL pipeline, all extracted content (articles, posts, repositories, user profiles) must be persisted to MongoDB in a consistent, type-safe manner. The ODM layer provides automatic serialization/deserialization, model validation via Pydantic, and a uniform CRUD interface across all document types.

Theoretical Foundation

Object-Document Mapping (ODM)

ODM is the NoSQL counterpart to Object-Relational Mapping (ORM). While ORM maps objects to relational table rows, ODM maps objects to document store entries (JSON/BSON documents in MongoDB):

Concept ORM (SQL) ODM (NoSQL)
Data Unit Table Row Document (BSON)
Schema Table Schema (DDL) Pydantic Model
Identity Primary Key Document _id (UUID)
Relationships Foreign Keys, JOINs Embedded documents, references
Validation Database constraints Pydantic validators
Serialization SQL parameterization to_mongo() / from_mongo()

The key advantage of ODM in this context is schema flexibility: different document types (articles, posts, repositories) can have different fields without requiring schema migrations, while Pydantic still enforces type safety at the application level.

The Active Record Pattern

The persistence layer follows the Active Record pattern where domain objects encapsulate both data and database operations:

NoSQLBaseDocument
  |
  +-- Data fields (via Pydantic BaseModel)
  |     - id: UUID4
  |     - created_at: datetime
  |     - ...domain-specific fields
  |
  +-- Persistence methods
        - save() -> self
        - find(**filters) -> instance
        - get_or_create(**filters) -> instance
        - bulk_insert(documents) -> bool
        - _get_collection() -> pymongo.Collection
        - to_mongo() -> dict
        - from_mongo(data) -> instance

Each document type inherits this full capability:

  • UserDocument persists to the users collection
  • ArticleDocument persists to the articles collection
  • PostDocument persists to the posts collection
  • RepositoryDocument persists to the repositories collection

Pydantic-Based Validation

By building the ODM on top of Pydantic's BaseModel, every document undergoes automatic validation before persistence:

  • Type Checking: Fields must match their declared types (str, int, UUID4, datetime, etc.)
  • Required Fields: Missing required fields raise validation errors at instantiation time
  • Default Values: Fields with defaults (like id and created_at) are auto-populated
  • Serialization: to_mongo() converts Pydantic models to MongoDB-compatible dictionaries, handling UUID and datetime serialization

This ensures data integrity at the application boundary -- malformed data never reaches the database.

Collection Routing

Each document subclass declares its target MongoDB collection through a Settings inner class:

class ArticleDocument(NoSQLBaseDocument):
    class Settings:
        name = "articles"  # MongoDB collection name

The base class uses this setting in _get_collection() to route persistence operations to the correct collection. This is a form of convention over configuration -- the mapping between domain types and database collections is declared once and enforced automatically.

Usage

Document Persistence is applied when persisting structured domain objects to MongoDB with consistent serialization, validation, and CRUD operations. The typical patterns are:

Single Document Persistence

  1. Create a domain document instance (e.g., from crawled content)
  2. Call .save() on the instance
  3. The base class serializes the document via to_mongo() and inserts it into the appropriate collection
  4. The saved instance (or None on failure) is returned

Bulk Document Persistence

  1. Collect a list of domain document instances
  2. Call cls.bulk_insert(documents) on the document class
  3. All documents are serialized and inserted in a single insert_many operation
  4. Returns True on success, False on failure

Get-or-Create Persistence

  1. Provide filter criteria (e.g., first name and last name)
  2. Call cls.get_or_create(**filters)
  3. If a matching document exists, it is returned; otherwise, a new one is created, saved, and returned

Design Considerations

  • Error Handling: All persistence methods catch WriteError exceptions from pymongo, log the error via loguru, and return a failure indicator (None or False) rather than raising. This makes the pipeline resilient to individual document failures.
  • No Upsert Semantics: The save() method performs insert_one, not update_one with upsert. Saving a document with an existing _id will fail with a duplicate key error. This is intentional -- updates require explicit logic.
  • UUID-Based Identity: Documents use UUID4 identifiers generated at instantiation time, not MongoDB's ObjectId. This allows identity to be established before persistence and decouples the domain model from MongoDB internals.
  • No Lazy Loading: The ODM does not implement lazy loading of related documents. Cross-document references (e.g., user references in articles) are resolved at the application level.

Related Concepts

  • Object-Relational Mapping (ORM) -- the SQL counterpart to ODM (e.g., SQLAlchemy, Django ORM)
  • Active Record Pattern (Fowler) -- domain objects that encapsulate data and persistence
  • Repository Pattern -- an alternative where persistence logic is separated from domain objects
  • Data Mapper Pattern -- another alternative with explicit mapping layers between domain and storage
  • BSON Serialization -- MongoDB's binary JSON format for document storage

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment