Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Lance format Lance Dataset Scan

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Columnar_Storage
Last Updated 2026-02-08 19:00 GMT

Overview

Concrete tool for reading data from a Lance dataset using a builder-pattern scanner, provided by the Lance library.

Description

Dataset::scan() creates a Scanner that provides a fluent builder API for constructing read queries. The scanner supports column projection, SQL-style filtering, limit/offset pagination, k-nearest-neighbor vector search, full-text search, ordering, and batch size control. Once configured, the scanner produces results as an async stream of Arrow RecordBatches or can collect all results into a single batch.

Additionally, Dataset::take and Dataset::take_rows provide direct random access to specific rows by index or row ID respectively, bypassing the scan planning pipeline for point lookups.

Usage

Use the scanner when:

  • Reading a subset of columns from a large dataset to reduce I/O.
  • Applying SQL predicates to filter rows before processing.
  • Running vector similarity searches over embedding columns.
  • Paginating through results with limit and offset.
  • Counting rows matching a filter without materializing data.

Code Reference

Source Location

  • Repository: Lance
  • File: rust/lance/src/dataset.rs (scan entry point L1370-L1372), rust/lance/src/dataset/scanner.rs (Scanner struct L479-L559)
  • Lines: L1370-L1372 (Dataset::scan), L870 (project), L948 (filter), L1021 (batch_size), L1124 (limit), L1147 (nearest)

Signature

// Dataset entry point
impl Dataset {
    pub fn scan(&self) -> Scanner;

    pub async fn take(
        &self,
        row_indices: &[u64],
        projection: impl Into<ProjectionRequest>,
    ) -> Result<RecordBatch>;

    pub async fn take_rows(
        &self,
        row_ids: &[u64],
        projection: impl Into<ProjectionRequest>,
    ) -> Result<RecordBatch>;

    pub async fn count_rows(&self, filter: Option<String>) -> Result<usize>;
}
// Scanner builder methods
impl Scanner {
    pub fn project<T: AsRef<str>>(&mut self, columns: &[T]) -> Result<&mut Self>;
    pub fn filter(&mut self, filter: &str) -> Result<&mut Self>;
    pub fn batch_size(&mut self, batch_size: usize) -> &mut Self;
    pub fn limit(&mut self, limit: Option<i64>, offset: Option<i64>) -> Result<&mut Self>;
    pub fn nearest(&mut self, column: &str, q: &dyn Array, k: usize) -> Result<&mut Self>;
    pub fn full_text_search(&mut self, query: FullTextSearchQuery) -> Result<&mut Self>;
    pub fn prefilter(&mut self, should_prefilter: bool) -> &mut Self;
    pub fn scan_in_order(&mut self, ordered: bool) -> &mut Self;
    pub fn with_row_id(&mut self) -> &mut Self;
    pub fn io_buffer_size(&mut self, size: u64) -> &mut Self;

    // Execution methods
    pub async fn try_into_stream(&self) -> Result<DatasetRecordBatchStream>;
    pub async fn try_into_batch(&self) -> Result<RecordBatch>;
    pub async fn count_rows(&self) -> Result<u64>;
}

Import

use lance::dataset::{Dataset, ProjectionRequest};
use lance::dataset::scanner::Scanner;

I/O Contract

Inputs

Name Type Required Description
&self (scan) &Dataset Yes The dataset to scan.
columns (project) &[T: AsRef<str>] No Column names to project. If not set, all columns are returned.
filter &str No SQL-style predicate string (e.g., "age > 21 AND status = 'active'").
batch_size usize No Maximum number of rows per output RecordBatch.
limit Option<i64> No Maximum number of rows to return.
offset Option<i64> No Number of rows to skip before returning results.
column (nearest) &str No Vector column name for KNN search.
q (nearest) &dyn Array No Query vector (Float16/32/64 Array or List thereof).
k (nearest) usize No Number of nearest neighbors to return.
row_indices (take) &[u64] Yes (for take) Row indices for point lookups.
row_ids (take_rows) &[u64] Yes (for take_rows) Internal row IDs for point lookups.
projection (take/take_rows) impl Into<ProjectionRequest> Yes Columns to return for point lookups.

Outputs

Name Type Description
Stream Result<DatasetRecordBatchStream> Async stream of Arrow RecordBatches via try_into_stream().
Batch Result<RecordBatch> All matching rows collected into a single RecordBatch via try_into_batch().
Count Result<u64> Number of matching rows via count_rows().
RecordBatch (take) Result<RecordBatch> Rows at the specified indices.

Usage Examples

Basic Scan with Filter and Projection

use lance::dataset::Dataset;

async fn read_filtered(uri: &str) -> lance::Result<()> {
    let dataset = Dataset::open(uri).await?;
    let mut scanner = dataset.scan();
    let batches = scanner
        .project(&["name", "age"])?
        .filter("age > 21")?
        .limit(Some(100), None)?
        .try_into_stream()
        .await?;
    // Process batches...
    Ok(())
}

Vector Nearest-Neighbor Search

use arrow_array::Float32Array;
use lance::dataset::Dataset;

async fn vector_search(uri: &str) -> lance::Result<()> {
    let dataset = Dataset::open(uri).await?;
    let query = Float32Array::from(vec![0.1, 0.2, 0.3, 0.4]);
    let mut scanner = dataset.scan();
    let results = scanner
        .nearest("embedding", &query, 10)?
        .try_into_batch()
        .await?;
    println!("Found {} neighbors", results.num_rows());
    Ok(())
}

Point Lookup by Row Index

use lance::dataset::{Dataset, ProjectionRequest};

async fn point_lookup(dataset: &Dataset) -> lance::Result<()> {
    let rows = dataset.take(&[0, 42, 100], dataset.schema().clone()).await?;
    println!("Fetched {} rows", rows.num_rows());
    Ok(())
}

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment