Implementation:PacktPublishing LLM Engineers Handbook VectorBaseDocument Bulk Find
Appearance
| Aspect | Detail |
|---|---|
| API | None] |
| Source | llm_engineering/domain/base/vector.py:L106-135 |
| Type | API Doc |
| Implements | Principle:PacktPublishing_LLM_Engineers_Handbook_Feature_Store_Query |
Summary
The bulk_find class method on VectorBaseDocument provides paginated retrieval of cleaned documents from a Qdrant vector store collection. It wraps the Qdrant client's scroll API to return a batch of documents along with a pagination cursor (next_offset) for iterating through large result sets. This is the primary entry point for querying the feature store during dataset generation.
Source Code
@classmethod
def bulk_find(cls, limit: int = 10, **kwargs) -> tuple[list, str | None]:
collection_name = cls.get_collection_name()
qdrant_client = connection.get_qdrant_client()
# Uses scroll API for paginated retrieval
records, next_offset = qdrant_client.scroll(
collection_name=collection_name,
limit=limit,
with_payload=True,
with_vectors=True,
scroll_filter=... # Optional filtering
)
return [cls.from_record(record) for record in records], next_offset
Import
from llm_engineering.domain.base.vector import VectorBaseDocument
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
limit |
int |
10 |
Maximum number of documents to return per batch (page size) |
**kwargs |
varies | -- | Optional filter arguments passed to the Qdrant scroll filter (e.g., category filtering) |
Return Value
| Component | Type | Description |
|---|---|---|
| Documents | list[T] |
List of deserialized VectorBaseDocument subclass instances (e.g., CleanedArticle, CleanedPost, CleanedRepositoryDocument)
|
| Next Offset | None | Pagination cursor for the next batch. None indicates no more results remain.
|
Behavior
- Resolves the Qdrant collection name from the class using
cls.get_collection_name(). Each document subclass maps to its own collection. - Obtains the Qdrant client instance via the connection module's singleton accessor.
- Calls
qdrant_client.scroll()with:collection_name-- the resolved collectionlimit-- the requested batch sizewith_payload=True-- includes the stored document datawith_vectors=True-- includes the embedding vectors- Optional scroll filter constructed from
**kwargs
- Deserializes each Qdrant record into the appropriate domain object using
cls.from_record(record). - Returns the list of domain objects and the
next_offsetcursor.
Usage Example
from llm_engineering.domain.cleaned_documents import CleanedArticle
# Retrieve first batch of 50 cleaned articles
articles, next_offset = CleanedArticle.bulk_find(limit=50)
# Continue paginating
while next_offset is not None:
more_articles, next_offset = CleanedArticle.bulk_find(limit=50, offset=next_offset)
articles.extend(more_articles)
print(f"Retrieved {len(articles)} cleaned articles from feature store")
External Dependencies
| Dependency | Purpose |
|---|---|
qdrant_client |
Qdrant vector database client for scroll-based retrieval |
pydantic |
Base model for VectorBaseDocument serialization/deserialization
|
Design Notes
- The method is a classmethod so that each subclass (e.g.,
CleanedArticle,CleanedPost) automatically resolves to its own Qdrant collection without additional configuration. - Using
with_payload=Trueandwith_vectors=Trueensures full document reconstruction, which is necessary when the downstream pipeline needs both the text content and the embedding vectors. - The scroll API is preferred over search for exhaustive retrieval because it does not require a query vector and guarantees iteration over the entire collection.
See Also
- Principle:PacktPublishing_LLM_Engineers_Handbook_Feature_Store_Query -- The principle this implementation realizes
- Implementation:PacktPublishing_LLM_Engineers_Handbook_DatasetGenerator_Get_Prompts -- The next step that consumes retrieved documents
- Environment:PacktPublishing_LLM_Engineers_Handbook_Docker_MongoDB_Qdrant_Infrastructure
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment