Implementation:Lance format Lance Dataset Scan
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Columnar_Storage |
| Last Updated | 2026-02-08 19:00 GMT |
Overview
Concrete tool for reading data from a Lance dataset using a builder-pattern scanner, provided by the Lance library.
Description
Dataset::scan() creates a Scanner that provides a fluent builder API for constructing read queries. The scanner supports column projection, SQL-style filtering, limit/offset pagination, k-nearest-neighbor vector search, full-text search, ordering, and batch size control. Once configured, the scanner produces results as an async stream of Arrow RecordBatches or can collect all results into a single batch.
Additionally, Dataset::take and Dataset::take_rows provide direct random access to specific rows by index or row ID respectively, bypassing the scan planning pipeline for point lookups.
Usage
Use the scanner when:
- Reading a subset of columns from a large dataset to reduce I/O.
- Applying SQL predicates to filter rows before processing.
- Running vector similarity searches over embedding columns.
- Paginating through results with limit and offset.
- Counting rows matching a filter without materializing data.
Code Reference
Source Location
- Repository: Lance
- File:
rust/lance/src/dataset.rs(scan entry point L1370-L1372),rust/lance/src/dataset/scanner.rs(Scanner struct L479-L559) - Lines: L1370-L1372 (Dataset::scan), L870 (project), L948 (filter), L1021 (batch_size), L1124 (limit), L1147 (nearest)
Signature
// Dataset entry point
impl Dataset {
pub fn scan(&self) -> Scanner;
pub async fn take(
&self,
row_indices: &[u64],
projection: impl Into<ProjectionRequest>,
) -> Result<RecordBatch>;
pub async fn take_rows(
&self,
row_ids: &[u64],
projection: impl Into<ProjectionRequest>,
) -> Result<RecordBatch>;
pub async fn count_rows(&self, filter: Option<String>) -> Result<usize>;
}
// Scanner builder methods
impl Scanner {
pub fn project<T: AsRef<str>>(&mut self, columns: &[T]) -> Result<&mut Self>;
pub fn filter(&mut self, filter: &str) -> Result<&mut Self>;
pub fn batch_size(&mut self, batch_size: usize) -> &mut Self;
pub fn limit(&mut self, limit: Option<i64>, offset: Option<i64>) -> Result<&mut Self>;
pub fn nearest(&mut self, column: &str, q: &dyn Array, k: usize) -> Result<&mut Self>;
pub fn full_text_search(&mut self, query: FullTextSearchQuery) -> Result<&mut Self>;
pub fn prefilter(&mut self, should_prefilter: bool) -> &mut Self;
pub fn scan_in_order(&mut self, ordered: bool) -> &mut Self;
pub fn with_row_id(&mut self) -> &mut Self;
pub fn io_buffer_size(&mut self, size: u64) -> &mut Self;
// Execution methods
pub async fn try_into_stream(&self) -> Result<DatasetRecordBatchStream>;
pub async fn try_into_batch(&self) -> Result<RecordBatch>;
pub async fn count_rows(&self) -> Result<u64>;
}
Import
use lance::dataset::{Dataset, ProjectionRequest};
use lance::dataset::scanner::Scanner;
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| &self (scan) | &Dataset |
Yes | The dataset to scan. |
| columns (project) | &[T: AsRef<str>] |
No | Column names to project. If not set, all columns are returned. |
| filter | &str |
No | SQL-style predicate string (e.g., "age > 21 AND status = 'active'").
|
| batch_size | usize |
No | Maximum number of rows per output RecordBatch. |
| limit | Option<i64> |
No | Maximum number of rows to return. |
| offset | Option<i64> |
No | Number of rows to skip before returning results. |
| column (nearest) | &str |
No | Vector column name for KNN search. |
| q (nearest) | &dyn Array |
No | Query vector (Float16/32/64 Array or List thereof). |
| k (nearest) | usize |
No | Number of nearest neighbors to return. |
| row_indices (take) | &[u64] |
Yes (for take) | Row indices for point lookups. |
| row_ids (take_rows) | &[u64] |
Yes (for take_rows) | Internal row IDs for point lookups. |
| projection (take/take_rows) | impl Into<ProjectionRequest> |
Yes | Columns to return for point lookups. |
Outputs
| Name | Type | Description |
|---|---|---|
| Stream | Result<DatasetRecordBatchStream> |
Async stream of Arrow RecordBatches via try_into_stream().
|
| Batch | Result<RecordBatch> |
All matching rows collected into a single RecordBatch via try_into_batch().
|
| Count | Result<u64> |
Number of matching rows via count_rows().
|
| RecordBatch (take) | Result<RecordBatch> |
Rows at the specified indices. |
Usage Examples
Basic Scan with Filter and Projection
use lance::dataset::Dataset;
async fn read_filtered(uri: &str) -> lance::Result<()> {
let dataset = Dataset::open(uri).await?;
let mut scanner = dataset.scan();
let batches = scanner
.project(&["name", "age"])?
.filter("age > 21")?
.limit(Some(100), None)?
.try_into_stream()
.await?;
// Process batches...
Ok(())
}
Vector Nearest-Neighbor Search
use arrow_array::Float32Array;
use lance::dataset::Dataset;
async fn vector_search(uri: &str) -> lance::Result<()> {
let dataset = Dataset::open(uri).await?;
let query = Float32Array::from(vec![0.1, 0.2, 0.3, 0.4]);
let mut scanner = dataset.scan();
let results = scanner
.nearest("embedding", &query, 10)?
.try_into_batch()
.await?;
println!("Found {} neighbors", results.num_rows());
Ok(())
}
Point Lookup by Row Index
use lance::dataset::{Dataset, ProjectionRequest};
async fn point_lookup(dataset: &Dataset) -> lance::Result<()> {
let rows = dataset.take(&[0, 42, 100], dataset.schema().clone()).await?;
println!("Fetched {} rows", rows.num_rows());
Ok(())
}
Related Pages
Implements Principle
Requires Environment
- Environment:Lance_format_Lance_Rust_Toolchain
- Environment:Lance_format_Lance_Python_Environment
- Environment:Lance_format_Lance_Cloud_Storage_Credentials