Implementation:Lance format Lance Dataset Write

Knowledge Sources	Lance Lance Docs
Domains	Data_Engineering, Columnar_Storage
Last Updated	2026-02-08 19:00 GMT

Overview

Concrete tool for creating a new Lance dataset from a stream of Arrow RecordBatches, provided by the Lance library.

Description

Dataset::write is the primary entry point for creating or writing to a Lance dataset. It accepts an Arrow RecordBatchReader as data source, a destination (URI string, object_store::path::Path, or Arc<Dataset>), and optional WriteParams. Internally, it constructs an InsertBuilder, applies the provided parameters, and executes the write-and-commit pipeline. The method writes data fragments to storage and atomically commits a manifest, returning the newly created Dataset.

The WriteParams struct controls all physical and logical aspects of the write, including file size limits, write mode, storage configuration, commit handling, row ID stability, manifest path format, and automatic version cleanup.

WriteDestination is an enum that accepts either an existing Arc<Dataset> (for appending or overwriting) or a URI string (for creating a new dataset at a path).

Usage

Use Dataset::write when:

Creating a new Lance dataset from Arrow data.
Overwriting an existing dataset with fresh data.
Programmatically controlling write parameters such as row group size, file size, or storage backend.

Code Reference

Source Location

Repository: Lance
File: rust/lance/src/dataset.rs
Lines: L749-L760

Signature

pub async fn write(
    batches: impl RecordBatchReader + Send + 'static,
    dest: impl Into<WriteDestination<'_>>,
    params: Option<WriteParams>,
) -> Result<Self>

Supporting Types

// rust/lance/src/dataset/write.rs:L60-L65
pub enum WriteDestination<'a> {
    /// An existing dataset to write to.
    Dataset(Arc<Dataset>),
    /// A URI to write to.
    Uri(&'a str),
}

// rust/lance/src/dataset/write.rs:L108-L116
pub enum WriteMode {
    Create,
    Append,
    Overwrite,
}

// rust/lance/src/dataset/write.rs:L152-L248
pub struct WriteParams {
    pub max_rows_per_file: usize,        // default: 1,048,576
    pub max_rows_per_group: usize,       // default: 1,024
    pub max_bytes_per_file: usize,       // default: 90 GB
    pub mode: WriteMode,                 // default: WriteMode::Create
    pub store_params: Option<ObjectStoreParams>,
    pub progress: Arc<dyn WriteFragmentProgress>,
    pub commit_handler: Option<Arc<dyn CommitHandler>>,
    pub data_storage_version: Option<LanceFileVersion>,
    pub enable_stable_row_ids: bool,
    pub enable_v2_manifest_paths: bool,  // default: true
    pub session: Option<Arc<Session>>,
    pub auto_cleanup: Option<AutoCleanupParams>,
    pub skip_auto_cleanup: bool,
    pub transaction_properties: Option<Arc<HashMap<String, String>>>,
    pub initial_bases: Option<Vec<BasePath>>,
    pub target_bases: Option<Vec<u32>>,
    pub target_base_names_or_paths: Option<Vec<String>>,
}

Import

use lance::dataset::{Dataset, WriteParams, WriteMode};

I/O Contract

Inputs

Name	Type	Required	Description
batches	`impl RecordBatchReader + Send + 'static`	Yes	Stream of Arrow RecordBatches containing the data to write. The schema is inferred from the reader.
dest	`impl Into<WriteDestination<'_>>`	Yes	Destination path (URI string, object_store Path) or existing dataset (`Arc<Dataset>`).
params	`Option<WriteParams>`	No	Optional write parameters controlling file layout, write mode, storage backend, commit handler, and more. Defaults to `WriteMode::Create` with 1M rows/file and 1K rows/group.

Outputs

Name	Type	Description
Result	`Result<Dataset>`	The newly created or updated Dataset on success, or an error if creation fails (e.g., dataset already exists in Create mode).

Usage Examples

Basic Example

use std::sync::Arc;
use arrow_array::{RecordBatch, RecordBatchIterator, Int64Array};
use arrow_schema::{Schema, Field, DataType};
use lance::dataset::{Dataset, WriteParams};

async fn create_dataset() -> lance::Result<()> {
    let schema = Arc::new(Schema::new(vec![
        Field::new("id", DataType::Int64, false),
    ]));
    let array = Arc::new(Int64Array::from(vec![1, 2, 3, 4, 5]));
    let batch = RecordBatch::try_new(schema.clone(), vec![array]).unwrap();
    let reader = RecordBatchIterator::new(
        vec![batch].into_iter().map(Ok),
        schema,
    );

    let dataset = Dataset::write(reader, "data/my_dataset", None).await?;
    println!("Created dataset v{}", dataset.version().version);
    Ok(())
}

With Custom Parameters

use lance::dataset::{Dataset, WriteParams, WriteMode};

async fn create_with_params(reader: impl arrow_array::RecordBatchReader + Send + 'static) -> lance::Result<()> {
    let params = WriteParams {
        max_rows_per_file: 500_000,
        max_rows_per_group: 2048,
        mode: WriteMode::Create,
        enable_v2_manifest_paths: true,
        enable_stable_row_ids: true,
        ..Default::default()
    };

    let dataset = Dataset::write(reader, "s3://bucket/my_dataset", Some(params)).await?;
    Ok(())
}

Related Pages

Implements Principle

Principle:Lance_format_Lance_Schema_Definition_And_Dataset_Creation

Requires Environment

Uses Heuristic

Heuristic:Lance_format_Lance_Fragment_Sizing_Strategy

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment