Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:PacktPublishing LLM Engineers Handbook Vector Storage

From Leeroopedia


Concept Vector database persistence for document embeddings
Workflow Feature_Engineering
Pipeline Stage Data Persistence
Repository PacktPublishing/LLM-Engineers-Handbook
Implemented By Implementation:PacktPublishing_LLM_Engineers_Handbook_VectorBaseDocument_Bulk_Insert

Overview

Vector Storage is the pattern of persisting documents with optional vector embeddings into a specialized database that supports both traditional filtering and approximate nearest-neighbor (ANN) search. In the LLM Engineers Handbook, Qdrant serves as the vector database, storing cleaned documents, chunked segments, and embedded representations used by the RAG retrieval system.

Theory

Vector databases are purpose-built storage systems optimized for high-dimensional vector similarity search. Unlike traditional databases that index scalar fields, vector databases build specialized index structures (e.g., HNSW — Hierarchical Navigable Small World graphs) that enable efficient approximate nearest-neighbor queries in high-dimensional spaces.

Collection-Level Partitioning

The Vector Storage pattern uses collection-level partitioning where each document type maps to its own Qdrant collection. This provides:

  • Schema isolation — Each collection can have its own vector dimensionality and distance metric
  • Query efficiency — Searches are scoped to a single document type, reducing the search space
  • Independent scaling — Collections can be sharded and replicated independently based on their size and access patterns

Documents With and Without Vectors

Not all documents stored in the vector database require vector embeddings. The pattern supports two modes:

  • With vector index — Documents that have an embedding field are stored with vector indexes, enabling similarity search. These are typically embedded chunks used for RAG retrieval.
  • Without vector index — Documents that lack embeddings are stored as structured payloads only. These are typically cleaned documents or intermediate representations that need to be persisted but are not yet searchable by similarity.

This dual-mode storage is determined automatically by inspecting whether the document class defines a vector field.

Bulk Upsert Pattern

The Vector Storage pattern uses upsert (update-or-insert) semantics for write operations. This ensures:

  • Idempotency — Running the pipeline multiple times with the same data produces the same result without duplicating records
  • Incremental updates — New or modified documents are inserted or updated without requiring a full collection rebuild
  • Consistency — Each document's UUID serves as a stable identifier across pipeline runs

How It Fits in Feature Engineering

Vector Storage is the final persistence step in the feature engineering pipeline. Documents flow through the pipeline as follows:

  1. Query — Raw documents are loaded from MongoDB
  2. Clean — Documents are normalized and sanitized
  3. Chunk — Cleaned documents are split into segments
  4. Embed — Chunks are converted into vector representations
  5. Store (this pattern) — Embedded chunks are persisted to Qdrant

At multiple stages in this pipeline, intermediate results may be persisted to the vector database. For example, cleaned documents may be stored before chunking, and chunks may be stored before embedding. This allows the pipeline to be interrupted and resumed without losing progress.

Design Considerations

  • Automatic collection creation — If the target collection does not exist, it is created on-the-fly with the appropriate configuration (vector size, distance metric). This eliminates the need for manual schema management.
  • Document grouping — When inserting a heterogeneous list of documents, they are grouped by class before insertion, ensuring each group is routed to the correct collection.
  • Payload serialization — Document fields are serialized to Qdrant-compatible payloads, with special handling for types like UUID, datetime, and nested Pydantic models.
  • Error resilience — Failed insertions are logged but do not crash the pipeline, allowing partial progress to be preserved.

Usage

Use the Vector Storage pattern when:

  • Persisting cleaned documents or embedded chunks to a vector database for later retrieval via similarity search or direct lookup
  • Building a feature store that needs to support both structured queries and semantic similarity search
  • Implementing an idempotent write layer that can be safely re-run without duplicating data
  • Storing intermediate pipeline results for debugging, auditing, or incremental processing

Example

from llm_engineering.domain.base.vector import VectorBaseDocument

# Assume we have a list of embedded chunk documents
embedded_chunks = [chunk1, chunk2, chunk3]

# Bulk insert into Qdrant — automatically groups by class,
# creates collections if needed, and upserts
success = VectorBaseDocument.bulk_insert(embedded_chunks)
print(f"Insert successful: {success}")

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment