Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Lance format Lance LanceTableProvider

From Leeroopedia


Knowledge Sources
Domains DataFusion, Infrastructure
Last Updated 2026-02-08 19:33 GMT

Overview

Description

The LanceTableProvider module provides two DataFusion TableProvider implementations for Lance datasets:

1. Direct impl TableProvider for Dataset (in logical_plan.rs): A straightforward implementation that exposes a Lance Dataset directly as a DataFusion table. It supports projection pushdown and limit pushdown but does not support filter pushdown. This is useful for simple SQL queries over Lance datasets.

2. LanceTableProvider struct (in dataframe.rs): A more full-featured table provider that additionally supports:

  • Filter pushdown -- All filters are reported as exactly applicable
  • System columns -- Optional inclusion of _rowid and _rowaddr columns in the schema
  • Ordered/unordered scans -- Configurable scan ordering
  • Limit pushdown -- Passed through to the Lance scanner

The module also provides the SessionContextExt trait that adds convenience methods to DataFusion's SessionContext:

  • read_lance -- Creates a DataFrame for an ordered Lance dataset scan
  • read_lance_unordered -- Creates a DataFrame for an unordered scan
  • read_one_shot -- Creates a DataFrame from a SendableRecordBatchStream

OneShotPartitionStream is a helper that wraps a SendableRecordBatchStream as a DataFusion PartitionStream that can only be consumed once.

Usage

These providers enable using SQL queries and DataFusion's query planning engine over Lance datasets. They are registered with a DataFusion SessionContext to make Lance tables accessible via SQL.

Code Reference

Source Location

  • rust/lance/src/datafusion/logical_plan.rs -- Direct Dataset as TableProvider
  • rust/lance/src/datafusion/dataframe.rs -- LanceTableProvider struct and SessionContextExt

Signature

// logical_plan.rs
#[async_trait]
impl TableProvider for Dataset { /* ... */ }

// dataframe.rs
#[derive(Debug)]
pub struct LanceTableProvider {
    dataset: Arc<Dataset>,
    full_schema: Arc<Schema>,
    row_id_idx: Option<usize>,
    row_addr_idx: Option<usize>,
    ordered: bool,
}

impl LanceTableProvider {
    pub fn new(dataset: Arc<Dataset>, with_row_id: bool, with_row_addr: bool) -> Self;
    pub fn new_with_ordering(
        dataset: Arc<Dataset>, with_row_id: bool, with_row_addr: bool, ordered: bool,
    ) -> Self;
    pub fn dataset(&self) -> Arc<Dataset>;
}

pub trait SessionContextExt {
    fn read_lance(&self, dataset: Arc<Dataset>, with_row_id: bool, with_row_addr: bool)
        -> datafusion::common::Result<DataFrame>;
    fn read_lance_unordered(&self, dataset: Arc<Dataset>, with_row_id: bool, with_row_addr: bool)
        -> datafusion::common::Result<DataFrame>;
    fn read_one_shot(&self, data: SendableRecordBatchStream)
        -> datafusion::common::Result<DataFrame>;
}

Import

use lance::datafusion::{LanceTableProvider, SessionContextExt};

I/O Contract

Inputs

Parameter Type Description
dataset Arc<Dataset> The Lance dataset to expose as a DataFusion table
with_row_id bool Whether to include the _rowid system column
with_row_addr bool Whether to include the _rowaddr system column
ordered bool Whether to return results in deterministic order (default: true)
projection Option<&Vec<usize>> Column indices to project (from DataFusion)
filters &[Expr] DataFusion filter expressions for pushdown
limit Option<usize> Maximum number of rows to return

Outputs

Type Description
Arc<dyn ExecutionPlan> A DataFusion execution plan that scans the Lance dataset
SchemaRef The schema of the table (including any requested system columns)
DataFrame A DataFusion DataFrame for further query composition (via SessionContextExt)

Usage Examples

use lance::datafusion::{LanceTableProvider, SessionContextExt};
use lance::Dataset;
use datafusion::prelude::SessionContext;
use std::sync::Arc;

// Register a Lance dataset as a DataFusion table
let dataset = Dataset::open("/path/to/data.lance").await?;
let ctx = SessionContext::new();
ctx.register_table(
    "my_table",
    Arc::new(LanceTableProvider::new(Arc::new(dataset), true, false)),
)?;

// Query with SQL
let df = ctx.sql("SELECT * FROM my_table WHERE id > 100 LIMIT 10").await?;
let results = df.collect().await?;

// Or use the convenience extension
let dataset = Dataset::open("/path/to/data.lance").await?;
let df = ctx.read_lance(Arc::new(dataset), false, false)?;

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment