Principle:Lance format Lance Data Scanning And Reading
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Columnar_Storage |
| Last Updated | 2026-02-08 19:00 GMT |
Overview
Data scanning and reading is the process of querying a Lance dataset to retrieve rows as Arrow RecordBatches, with support for projection, filtering, limiting, vector search, and full-text search.
Description
Lance provides a builder-based Scanner API for constructing read queries against a dataset. The scanner supports a rich set of operations that are lazily composed and then executed as an optimized physical plan:
- Projection: Select a subset of columns, including nested struct fields and computed expressions.
- Filtering: Apply SQL-style predicates to skip non-matching rows, with optional scalar index acceleration.
- Limiting and Offset: Restrict the number of returned rows, with optional offset for pagination.
- Vector Nearest-Neighbor Search: Find the k closest vectors to a query vector using ANN indices.
- Full-Text Search: Perform BM25-scored text search on string columns with inverted indices.
- Ordering: Sort results by one or more columns.
- Row ID / Row Address Access: Include internal identifiers for downstream operations like updates or deletes.
In addition to streaming scans, Lance supports point lookups through take (by row index) and take_rows (by internal row ID), which provide O(1) random access to individual rows.
Usage
Use scanning and reading when:
- Loading training data for ML models with column projection to reduce I/O.
- Filtering datasets by metadata predicates before vectorized processing.
- Performing similarity search over embedding columns.
- Paginating through large datasets with limit and offset.
- Retrieving specific rows by their index or row ID for inference or debugging.
Theoretical Basis
The Lance scanner translates a logical query plan into a physical execution plan optimized for columnar I/O:
Scan Planning
- Column pruning: Only the columns referenced in the projection and filter are read from storage. This is critical for wide tables where only a few columns are needed.
- Predicate pushdown: Filters are pushed down to the fragment level. Lance uses zone maps (min/max statistics per row group) to skip entire row groups that cannot contain matching rows. When scalar indices (BTree, inverted) are available, they are used for index-based filtering.
- Late materialization: For selective queries (those that return a small fraction of rows), Lance can first identify matching row addresses using only the filter columns, then fetch the projected columns only for matching rows. This is controlled by the
MaterializationStylesetting.
Vector Search
For nearest-neighbor queries, Lance executes a hybrid plan:
- If prefilter is enabled, the SQL filter is applied first to produce a candidate set, and the vector index search is restricted to that set. This guarantees accurate results but may be slower.
- If prefilter is disabled (default), the vector index returns the top-k candidates, and the filter is applied post-hoc. This is faster but may return fewer than k results.
Execution Model
The scanner produces a DatasetRecordBatchStream, which is a tokio-based async stream of Arrow RecordBatches. Batches are read concurrently from multiple fragments, with configurable batch size, batch readahead, and I/O buffer size to balance throughput and memory usage.