Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:PacktPublishing LLM Engineers Handbook VectorBaseDocument Bulk Find

From Leeroopedia
Revision as of 16:18, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/PacktPublishing_LLM_Engineers_Handbook_VectorBaseDocument_Bulk_Find.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Aspect Detail
API None]
Source llm_engineering/domain/base/vector.py:L106-135
Type API Doc
Implements Principle:PacktPublishing_LLM_Engineers_Handbook_Feature_Store_Query

Summary

The bulk_find class method on VectorBaseDocument provides paginated retrieval of cleaned documents from a Qdrant vector store collection. It wraps the Qdrant client's scroll API to return a batch of documents along with a pagination cursor (next_offset) for iterating through large result sets. This is the primary entry point for querying the feature store during dataset generation.

Source Code

@classmethod
def bulk_find(cls, limit: int = 10, **kwargs) -> tuple[list, str | None]:
    collection_name = cls.get_collection_name()
    qdrant_client = connection.get_qdrant_client()

    # Uses scroll API for paginated retrieval
    records, next_offset = qdrant_client.scroll(
        collection_name=collection_name,
        limit=limit,
        with_payload=True,
        with_vectors=True,
        scroll_filter=...  # Optional filtering
    )

    return [cls.from_record(record) for record in records], next_offset

Import

from llm_engineering.domain.base.vector import VectorBaseDocument

Parameters

Parameter Type Default Description
limit int 10 Maximum number of documents to return per batch (page size)
**kwargs varies -- Optional filter arguments passed to the Qdrant scroll filter (e.g., category filtering)

Return Value

Component Type Description
Documents list[T] List of deserialized VectorBaseDocument subclass instances (e.g., CleanedArticle, CleanedPost, CleanedRepositoryDocument)
Next Offset None Pagination cursor for the next batch. None indicates no more results remain.

Behavior

  1. Resolves the Qdrant collection name from the class using cls.get_collection_name(). Each document subclass maps to its own collection.
  2. Obtains the Qdrant client instance via the connection module's singleton accessor.
  3. Calls qdrant_client.scroll() with:
    • collection_name -- the resolved collection
    • limit -- the requested batch size
    • with_payload=True -- includes the stored document data
    • with_vectors=True -- includes the embedding vectors
    • Optional scroll filter constructed from **kwargs
  4. Deserializes each Qdrant record into the appropriate domain object using cls.from_record(record).
  5. Returns the list of domain objects and the next_offset cursor.

Usage Example

from llm_engineering.domain.cleaned_documents import CleanedArticle

# Retrieve first batch of 50 cleaned articles
articles, next_offset = CleanedArticle.bulk_find(limit=50)

# Continue paginating
while next_offset is not None:
    more_articles, next_offset = CleanedArticle.bulk_find(limit=50, offset=next_offset)
    articles.extend(more_articles)

print(f"Retrieved {len(articles)} cleaned articles from feature store")

External Dependencies

Dependency Purpose
qdrant_client Qdrant vector database client for scroll-based retrieval
pydantic Base model for VectorBaseDocument serialization/deserialization

Design Notes

  • The method is a classmethod so that each subclass (e.g., CleanedArticle, CleanedPost) automatically resolves to its own Qdrant collection without additional configuration.
  • Using with_payload=True and with_vectors=True ensures full document reconstruction, which is necessary when the downstream pipeline needs both the text content and the embedding vectors.
  • The scroll API is preferred over search for exhaustive retrieval because it does not require a query vector and guarantees iteration over the entire collection.

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment