Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ucbepic Docetl LanceDBRetriever

From Leeroopedia


Knowledge Sources
Domains Data_Processing, Information_Retrieval, RAG
Last Updated 2026-02-08 00:00 GMT

Overview

Concrete tool for a LanceDB-backed retriever supporting full-text search, vector search, and hybrid retrieval for RAG operations provided by DocETL.

Description

The LanceDBRetriever class extends the base Retriever to provide retrieval-augmented generation (RAG) capabilities backed by the LanceDB columnar database. It supports three search modes: full-text search (FTS), embedding-based vector search, and hybrid search combining both with RRF (Reciprocal Rank Fusion) reranking. The retriever lazily builds and manages indexes based on configurable build policies ("if_missing", "always", "never"), renders query and index phrases using Jinja2 templates, and batches embedding generation through the DocETL API. Thread-safe index construction is ensured via a class-level lock.

Usage

Use this retriever in DocETL pipelines that require RAG-style retrieval, such as resolve or map operations that need to look up related documents before processing. Configure it in the YAML pipeline under a retrievers section with LanceDB-specific options.

Code Reference

Source Location

Signature

class LanceDBRetriever(Retriever):
    _index_lock = threading.Lock()
    _ensured = False

    def _connect(self): ...
    def _table_name(self) -> str: ...
    def _iter_dataset_rows(self) -> list[dict]: ...
    def _render_input_phrase(self, tmpl: str | None, input_obj: dict) -> str: ...
    def _batch_embed(self, texts: list[str]) -> list[list[float]]: ...
    def _index_types(self) -> set[str]: ...
    def ensure_index(self) -> None: ...
    def _render_query_phrase(self, tmpl: str | None, context: dict[str, Any]) -> str: ...
    def _select_mode(self) -> str: ...
    def _reranker(self): ...
    def _limit_and_format(self, rows: list[dict]) -> list[dict]: ...
    def _fetch(self, context: dict[str, Any]) -> list[dict]: ...

Import

from docetl.retrievers.lancedb import LanceDBRetriever

I/O Contract

Inputs

Name Type Required Description
config["index_dir"] str Yes Directory path for LanceDB index storage
config["dataset"] str Yes Name of the DocETL dataset to index
config["index_types"] list[str] No Index types to build: "fts", "embedding", or "hybrid"
config["build_index"] str No Build policy: "if_missing" (default), "always", or "never"
config["fts"]["index_phrase"] str No Jinja2 template for FTS index text per row
config["fts"]["query_phrase"] str No Jinja2 template for FTS query text
config["embedding"]["model"] str No Embedding model name for vector search
config["embedding"]["index_phrase"] str No Jinja2 template for embedding index text
config["embedding"]["query_phrase"] str No Jinja2 template for embedding query text
config["query"]["mode"] str No Query mode: "fts", "embedding", or "hybrid"
config["query"]["top_k"] int No Number of results to return (default: 5)

Outputs

Name Type Description
results list[dict] Retrieved documents matching the query, limited to top_k

Usage Examples

# YAML configuration for a LanceDB retriever in a pipeline:
# retrievers:
#   knowledge_base:
#     type: lancedb
#     index_dir: ./lancedb_index
#     dataset: documents
#     index_types: [fts, embedding]
#     build_index: if_missing
#     embedding:
#       model: text-embedding-3-small
#       index_phrase: "{{ input.title }} {{ input.content }}"
#       query_phrase: "{{ input.query }}"
#     fts:
#       index_phrase: "{{ input.content }}"
#       query_phrase: "{{ input.search_term }}"
#     query:
#       mode: hybrid
#       top_k: 10

from docetl.retrievers.lancedb import LanceDBRetriever

# The retriever is typically instantiated by the DSLRunner from YAML config
retriever = LanceDBRetriever(name="knowledge_base", config=config, runner=runner)
retriever.ensure_index()
results = retriever._fetch({"query": "contract liability clauses"})

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment