Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Lance format Lance Dataset Write

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Columnar_Storage
Last Updated 2026-02-08 19:00 GMT

Overview

Concrete tool for creating a new Lance dataset from a stream of Arrow RecordBatches, provided by the Lance library.

Description

Dataset::write is the primary entry point for creating or writing to a Lance dataset. It accepts an Arrow RecordBatchReader as data source, a destination (URI string, object_store::path::Path, or Arc<Dataset>), and optional WriteParams. Internally, it constructs an InsertBuilder, applies the provided parameters, and executes the write-and-commit pipeline. The method writes data fragments to storage and atomically commits a manifest, returning the newly created Dataset.

The WriteParams struct controls all physical and logical aspects of the write, including file size limits, write mode, storage configuration, commit handling, row ID stability, manifest path format, and automatic version cleanup.

WriteDestination is an enum that accepts either an existing Arc<Dataset> (for appending or overwriting) or a URI string (for creating a new dataset at a path).

Usage

Use Dataset::write when:

  • Creating a new Lance dataset from Arrow data.
  • Overwriting an existing dataset with fresh data.
  • Programmatically controlling write parameters such as row group size, file size, or storage backend.

Code Reference

Source Location

  • Repository: Lance
  • File: rust/lance/src/dataset.rs
  • Lines: L749-L760

Signature

pub async fn write(
    batches: impl RecordBatchReader + Send + 'static,
    dest: impl Into<WriteDestination<'_>>,
    params: Option<WriteParams>,
) -> Result<Self>

Supporting Types

// rust/lance/src/dataset/write.rs:L60-L65
pub enum WriteDestination<'a> {
    /// An existing dataset to write to.
    Dataset(Arc<Dataset>),
    /// A URI to write to.
    Uri(&'a str),
}

// rust/lance/src/dataset/write.rs:L108-L116
pub enum WriteMode {
    Create,
    Append,
    Overwrite,
}

// rust/lance/src/dataset/write.rs:L152-L248
pub struct WriteParams {
    pub max_rows_per_file: usize,        // default: 1,048,576
    pub max_rows_per_group: usize,       // default: 1,024
    pub max_bytes_per_file: usize,       // default: 90 GB
    pub mode: WriteMode,                 // default: WriteMode::Create
    pub store_params: Option<ObjectStoreParams>,
    pub progress: Arc<dyn WriteFragmentProgress>,
    pub commit_handler: Option<Arc<dyn CommitHandler>>,
    pub data_storage_version: Option<LanceFileVersion>,
    pub enable_stable_row_ids: bool,
    pub enable_v2_manifest_paths: bool,  // default: true
    pub session: Option<Arc<Session>>,
    pub auto_cleanup: Option<AutoCleanupParams>,
    pub skip_auto_cleanup: bool,
    pub transaction_properties: Option<Arc<HashMap<String, String>>>,
    pub initial_bases: Option<Vec<BasePath>>,
    pub target_bases: Option<Vec<u32>>,
    pub target_base_names_or_paths: Option<Vec<String>>,
}

Import

use lance::dataset::{Dataset, WriteParams, WriteMode};

I/O Contract

Inputs

Name Type Required Description
batches impl RecordBatchReader + Send + 'static Yes Stream of Arrow RecordBatches containing the data to write. The schema is inferred from the reader.
dest impl Into<WriteDestination<'_>> Yes Destination path (URI string, object_store Path) or existing dataset (Arc<Dataset>).
params Option<WriteParams> No Optional write parameters controlling file layout, write mode, storage backend, commit handler, and more. Defaults to WriteMode::Create with 1M rows/file and 1K rows/group.

Outputs

Name Type Description
Result Result<Dataset> The newly created or updated Dataset on success, or an error if creation fails (e.g., dataset already exists in Create mode).

Usage Examples

Basic Example

use std::sync::Arc;
use arrow_array::{RecordBatch, RecordBatchIterator, Int64Array};
use arrow_schema::{Schema, Field, DataType};
use lance::dataset::{Dataset, WriteParams};

async fn create_dataset() -> lance::Result<()> {
    let schema = Arc::new(Schema::new(vec![
        Field::new("id", DataType::Int64, false),
    ]));
    let array = Arc::new(Int64Array::from(vec![1, 2, 3, 4, 5]));
    let batch = RecordBatch::try_new(schema.clone(), vec![array]).unwrap();
    let reader = RecordBatchIterator::new(
        vec![batch].into_iter().map(Ok),
        schema,
    );

    let dataset = Dataset::write(reader, "data/my_dataset", None).await?;
    println!("Created dataset v{}", dataset.version().version);
    Ok(())
}

With Custom Parameters

use lance::dataset::{Dataset, WriteParams, WriteMode};

async fn create_with_params(reader: impl arrow_array::RecordBatchReader + Send + 'static) -> lance::Result<()> {
    let params = WriteParams {
        max_rows_per_file: 500_000,
        max_rows_per_group: 2048,
        mode: WriteMode::Create,
        enable_v2_manifest_paths: true,
        enable_stable_row_ids: true,
        ..Default::default()
    };

    let dataset = Dataset::write(reader, "s3://bucket/my_dataset", Some(params)).await?;
    Ok(())
}

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment