Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Lance format Lance Data Scanning And Reading

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Columnar_Storage
Last Updated 2026-02-08 19:00 GMT

Overview

Data scanning and reading is the process of querying a Lance dataset to retrieve rows as Arrow RecordBatches, with support for projection, filtering, limiting, vector search, and full-text search.

Description

Lance provides a builder-based Scanner API for constructing read queries against a dataset. The scanner supports a rich set of operations that are lazily composed and then executed as an optimized physical plan:

  • Projection: Select a subset of columns, including nested struct fields and computed expressions.
  • Filtering: Apply SQL-style predicates to skip non-matching rows, with optional scalar index acceleration.
  • Limiting and Offset: Restrict the number of returned rows, with optional offset for pagination.
  • Vector Nearest-Neighbor Search: Find the k closest vectors to a query vector using ANN indices.
  • Full-Text Search: Perform BM25-scored text search on string columns with inverted indices.
  • Ordering: Sort results by one or more columns.
  • Row ID / Row Address Access: Include internal identifiers for downstream operations like updates or deletes.

In addition to streaming scans, Lance supports point lookups through take (by row index) and take_rows (by internal row ID), which provide O(1) random access to individual rows.

Usage

Use scanning and reading when:

  • Loading training data for ML models with column projection to reduce I/O.
  • Filtering datasets by metadata predicates before vectorized processing.
  • Performing similarity search over embedding columns.
  • Paginating through large datasets with limit and offset.
  • Retrieving specific rows by their index or row ID for inference or debugging.

Theoretical Basis

The Lance scanner translates a logical query plan into a physical execution plan optimized for columnar I/O:

Scan Planning

  1. Column pruning: Only the columns referenced in the projection and filter are read from storage. This is critical for wide tables where only a few columns are needed.
  2. Predicate pushdown: Filters are pushed down to the fragment level. Lance uses zone maps (min/max statistics per row group) to skip entire row groups that cannot contain matching rows. When scalar indices (BTree, inverted) are available, they are used for index-based filtering.
  3. Late materialization: For selective queries (those that return a small fraction of rows), Lance can first identify matching row addresses using only the filter columns, then fetch the projected columns only for matching rows. This is controlled by the MaterializationStyle setting.

Vector Search

For nearest-neighbor queries, Lance executes a hybrid plan:

  1. If prefilter is enabled, the SQL filter is applied first to produce a candidate set, and the vector index search is restricted to that set. This guarantees accurate results but may be slower.
  2. If prefilter is disabled (default), the vector index returns the top-k candidates, and the filter is applied post-hoc. This is faster but may return fewer than k results.

Execution Model

The scanner produces a DatasetRecordBatchStream, which is a tokio-based async stream of Arrow RecordBatches. Batches are read concurrently from multiple fragments, with configurable batch size, batch readahead, and I/O buffer size to balance throughput and memory usage.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment