Implementation:Ucbepic Docetl LanceDBRetriever
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Information_Retrieval, RAG |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for a LanceDB-backed retriever supporting full-text search, vector search, and hybrid retrieval for RAG operations provided by DocETL.
Description
The LanceDBRetriever class extends the base Retriever to provide retrieval-augmented generation (RAG) capabilities backed by the LanceDB columnar database. It supports three search modes: full-text search (FTS), embedding-based vector search, and hybrid search combining both with RRF (Reciprocal Rank Fusion) reranking. The retriever lazily builds and manages indexes based on configurable build policies ("if_missing", "always", "never"), renders query and index phrases using Jinja2 templates, and batches embedding generation through the DocETL API. Thread-safe index construction is ensured via a class-level lock.
Usage
Use this retriever in DocETL pipelines that require RAG-style retrieval, such as resolve or map operations that need to look up related documents before processing. Configure it in the YAML pipeline under a retrievers section with LanceDB-specific options.
Code Reference
Source Location
- Repository: Ucbepic_Docetl
- File: docetl/retrievers/lancedb.py
- Lines: 1-358
Signature
class LanceDBRetriever(Retriever):
_index_lock = threading.Lock()
_ensured = False
def _connect(self): ...
def _table_name(self) -> str: ...
def _iter_dataset_rows(self) -> list[dict]: ...
def _render_input_phrase(self, tmpl: str | None, input_obj: dict) -> str: ...
def _batch_embed(self, texts: list[str]) -> list[list[float]]: ...
def _index_types(self) -> set[str]: ...
def ensure_index(self) -> None: ...
def _render_query_phrase(self, tmpl: str | None, context: dict[str, Any]) -> str: ...
def _select_mode(self) -> str: ...
def _reranker(self): ...
def _limit_and_format(self, rows: list[dict]) -> list[dict]: ...
def _fetch(self, context: dict[str, Any]) -> list[dict]: ...
Import
from docetl.retrievers.lancedb import LanceDBRetriever
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config["index_dir"] | str | Yes | Directory path for LanceDB index storage |
| config["dataset"] | str | Yes | Name of the DocETL dataset to index |
| config["index_types"] | list[str] | No | Index types to build: "fts", "embedding", or "hybrid" |
| config["build_index"] | str | No | Build policy: "if_missing" (default), "always", or "never" |
| config["fts"]["index_phrase"] | str | No | Jinja2 template for FTS index text per row |
| config["fts"]["query_phrase"] | str | No | Jinja2 template for FTS query text |
| config["embedding"]["model"] | str | No | Embedding model name for vector search |
| config["embedding"]["index_phrase"] | str | No | Jinja2 template for embedding index text |
| config["embedding"]["query_phrase"] | str | No | Jinja2 template for embedding query text |
| config["query"]["mode"] | str | No | Query mode: "fts", "embedding", or "hybrid" |
| config["query"]["top_k"] | int | No | Number of results to return (default: 5) |
Outputs
| Name | Type | Description |
|---|---|---|
| results | list[dict] | Retrieved documents matching the query, limited to top_k |
Usage Examples
# YAML configuration for a LanceDB retriever in a pipeline:
# retrievers:
# knowledge_base:
# type: lancedb
# index_dir: ./lancedb_index
# dataset: documents
# index_types: [fts, embedding]
# build_index: if_missing
# embedding:
# model: text-embedding-3-small
# index_phrase: "{{ input.title }} {{ input.content }}"
# query_phrase: "{{ input.query }}"
# fts:
# index_phrase: "{{ input.content }}"
# query_phrase: "{{ input.search_term }}"
# query:
# mode: hybrid
# top_k: 10
from docetl.retrievers.lancedb import LanceDBRetriever
# The retriever is typically instantiated by the DSLRunner from YAML config
retriever = LanceDBRetriever(name="knowledge_base", config=config, runner=runner)
retriever.ensure_index()
results = retriever._fetch({"query": "contract liability clauses"})