Implementation:PacktPublishing LLM Engineers Handbook VectorBaseDocument Bulk Find

Aspect	Detail
API	None]
Source	llm_engineering/domain/base/vector.py:L106-135
Type	API Doc
Implements	Principle:PacktPublishing_LLM_Engineers_Handbook_Feature_Store_Query

Summary

The bulk_find class method on VectorBaseDocument provides paginated retrieval of cleaned documents from a Qdrant vector store collection. It wraps the Qdrant client's scroll API to return a batch of documents along with a pagination cursor (next_offset) for iterating through large result sets. This is the primary entry point for querying the feature store during dataset generation.

Source Code

@classmethod
def bulk_find(cls, limit: int = 10, **kwargs) -> tuple[list, str | None]:
    collection_name = cls.get_collection_name()
    qdrant_client = connection.get_qdrant_client()

    # Uses scroll API for paginated retrieval
    records, next_offset = qdrant_client.scroll(
        collection_name=collection_name,
        limit=limit,
        with_payload=True,
        with_vectors=True,
        scroll_filter=...  # Optional filtering
    )

    return [cls.from_record(record) for record in records], next_offset

Import

from llm_engineering.domain.base.vector import VectorBaseDocument

Parameters

Parameter	Type	Default	Description
`limit`	`int`	`10`	Maximum number of documents to return per batch (page size)
`**kwargs`	varies	--	Optional filter arguments passed to the Qdrant scroll filter (e.g., category filtering)

Return Value

Component	Type	Description
Documents	`list[T]`	List of deserialized `VectorBaseDocument` subclass instances (e.g., `CleanedArticle`, `CleanedPost`, `CleanedRepositoryDocument`)
Next Offset	None	Pagination cursor for the next batch. `None` indicates no more results remain.

Behavior

Resolves the Qdrant collection name from the class using cls.get_collection_name(). Each document subclass maps to its own collection.
Obtains the Qdrant client instance via the connection module's singleton accessor.
Calls qdrant_client.scroll() with:
- collection_name -- the resolved collection
- limit -- the requested batch size
- with_payload=True -- includes the stored document data
- with_vectors=True -- includes the embedding vectors
- Optional scroll filter constructed from **kwargs
Deserializes each Qdrant record into the appropriate domain object using cls.from_record(record).
Returns the list of domain objects and the next_offset cursor.

Usage Example

from llm_engineering.domain.cleaned_documents import CleanedArticle

# Retrieve first batch of 50 cleaned articles
articles, next_offset = CleanedArticle.bulk_find(limit=50)

# Continue paginating
while next_offset is not None:
    more_articles, next_offset = CleanedArticle.bulk_find(limit=50, offset=next_offset)
    articles.extend(more_articles)

print(f"Retrieved {len(articles)} cleaned articles from feature store")

External Dependencies

Dependency	Purpose
`qdrant_client`	Qdrant vector database client for scroll-based retrieval
`pydantic`	Base model for `VectorBaseDocument` serialization/deserialization

Design Notes

The method is a classmethod so that each subclass (e.g., CleanedArticle, CleanedPost) automatically resolves to its own Qdrant collection without additional configuration.
Using with_payload=True and with_vectors=True ensures full document reconstruction, which is necessary when the downstream pipeline needs both the text content and the embedding vectors.
The scroll API is preferred over search for exhaustive retrieval because it does not require a query vector and guarantees iteration over the entire collection.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment