Principle:PacktPublishing LLM Engineers Handbook Vector Storage
| Concept | Vector database persistence for document embeddings |
|---|---|
| Workflow | Feature_Engineering |
| Pipeline Stage | Data Persistence |
| Repository | PacktPublishing/LLM-Engineers-Handbook |
| Implemented By | Implementation:PacktPublishing_LLM_Engineers_Handbook_VectorBaseDocument_Bulk_Insert |
Overview
Vector Storage is the pattern of persisting documents with optional vector embeddings into a specialized database that supports both traditional filtering and approximate nearest-neighbor (ANN) search. In the LLM Engineers Handbook, Qdrant serves as the vector database, storing cleaned documents, chunked segments, and embedded representations used by the RAG retrieval system.
Theory
Vector databases are purpose-built storage systems optimized for high-dimensional vector similarity search. Unlike traditional databases that index scalar fields, vector databases build specialized index structures (e.g., HNSW — Hierarchical Navigable Small World graphs) that enable efficient approximate nearest-neighbor queries in high-dimensional spaces.
Collection-Level Partitioning
The Vector Storage pattern uses collection-level partitioning where each document type maps to its own Qdrant collection. This provides:
- Schema isolation — Each collection can have its own vector dimensionality and distance metric
- Query efficiency — Searches are scoped to a single document type, reducing the search space
- Independent scaling — Collections can be sharded and replicated independently based on their size and access patterns
Documents With and Without Vectors
Not all documents stored in the vector database require vector embeddings. The pattern supports two modes:
- With vector index — Documents that have an
embeddingfield are stored with vector indexes, enabling similarity search. These are typically embedded chunks used for RAG retrieval. - Without vector index — Documents that lack embeddings are stored as structured payloads only. These are typically cleaned documents or intermediate representations that need to be persisted but are not yet searchable by similarity.
This dual-mode storage is determined automatically by inspecting whether the document class defines a vector field.
Bulk Upsert Pattern
The Vector Storage pattern uses upsert (update-or-insert) semantics for write operations. This ensures:
- Idempotency — Running the pipeline multiple times with the same data produces the same result without duplicating records
- Incremental updates — New or modified documents are inserted or updated without requiring a full collection rebuild
- Consistency — Each document's UUID serves as a stable identifier across pipeline runs
How It Fits in Feature Engineering
Vector Storage is the final persistence step in the feature engineering pipeline. Documents flow through the pipeline as follows:
- Query — Raw documents are loaded from MongoDB
- Clean — Documents are normalized and sanitized
- Chunk — Cleaned documents are split into segments
- Embed — Chunks are converted into vector representations
- Store (this pattern) — Embedded chunks are persisted to Qdrant
At multiple stages in this pipeline, intermediate results may be persisted to the vector database. For example, cleaned documents may be stored before chunking, and chunks may be stored before embedding. This allows the pipeline to be interrupted and resumed without losing progress.
Design Considerations
- Automatic collection creation — If the target collection does not exist, it is created on-the-fly with the appropriate configuration (vector size, distance metric). This eliminates the need for manual schema management.
- Document grouping — When inserting a heterogeneous list of documents, they are grouped by class before insertion, ensuring each group is routed to the correct collection.
- Payload serialization — Document fields are serialized to Qdrant-compatible payloads, with special handling for types like UUID, datetime, and nested Pydantic models.
- Error resilience — Failed insertions are logged but do not crash the pipeline, allowing partial progress to be preserved.
Usage
Use the Vector Storage pattern when:
- Persisting cleaned documents or embedded chunks to a vector database for later retrieval via similarity search or direct lookup
- Building a feature store that needs to support both structured queries and semantic similarity search
- Implementing an idempotent write layer that can be safely re-run without duplicating data
- Storing intermediate pipeline results for debugging, auditing, or incremental processing
Example
from llm_engineering.domain.base.vector import VectorBaseDocument
# Assume we have a list of embedded chunk documents
embedded_chunks = [chunk1, chunk2, chunk3]
# Bulk insert into Qdrant — automatically groups by class,
# creates collections if needed, and upserts
success = VectorBaseDocument.bulk_insert(embedded_chunks)
print(f"Insert successful: {success}")