Implementation:Lance format Lance Dataset Write
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Columnar_Storage |
| Last Updated | 2026-02-08 19:00 GMT |
Overview
Concrete tool for creating a new Lance dataset from a stream of Arrow RecordBatches, provided by the Lance library.
Description
Dataset::write is the primary entry point for creating or writing to a Lance dataset. It accepts an Arrow RecordBatchReader as data source, a destination (URI string, object_store::path::Path, or Arc<Dataset>), and optional WriteParams. Internally, it constructs an InsertBuilder, applies the provided parameters, and executes the write-and-commit pipeline. The method writes data fragments to storage and atomically commits a manifest, returning the newly created Dataset.
The WriteParams struct controls all physical and logical aspects of the write, including file size limits, write mode, storage configuration, commit handling, row ID stability, manifest path format, and automatic version cleanup.
WriteDestination is an enum that accepts either an existing Arc<Dataset> (for appending or overwriting) or a URI string (for creating a new dataset at a path).
Usage
Use Dataset::write when:
- Creating a new Lance dataset from Arrow data.
- Overwriting an existing dataset with fresh data.
- Programmatically controlling write parameters such as row group size, file size, or storage backend.
Code Reference
Source Location
- Repository: Lance
- File:
rust/lance/src/dataset.rs - Lines: L749-L760
Signature
pub async fn write(
batches: impl RecordBatchReader + Send + 'static,
dest: impl Into<WriteDestination<'_>>,
params: Option<WriteParams>,
) -> Result<Self>
Supporting Types
// rust/lance/src/dataset/write.rs:L60-L65
pub enum WriteDestination<'a> {
/// An existing dataset to write to.
Dataset(Arc<Dataset>),
/// A URI to write to.
Uri(&'a str),
}
// rust/lance/src/dataset/write.rs:L108-L116
pub enum WriteMode {
Create,
Append,
Overwrite,
}
// rust/lance/src/dataset/write.rs:L152-L248
pub struct WriteParams {
pub max_rows_per_file: usize, // default: 1,048,576
pub max_rows_per_group: usize, // default: 1,024
pub max_bytes_per_file: usize, // default: 90 GB
pub mode: WriteMode, // default: WriteMode::Create
pub store_params: Option<ObjectStoreParams>,
pub progress: Arc<dyn WriteFragmentProgress>,
pub commit_handler: Option<Arc<dyn CommitHandler>>,
pub data_storage_version: Option<LanceFileVersion>,
pub enable_stable_row_ids: bool,
pub enable_v2_manifest_paths: bool, // default: true
pub session: Option<Arc<Session>>,
pub auto_cleanup: Option<AutoCleanupParams>,
pub skip_auto_cleanup: bool,
pub transaction_properties: Option<Arc<HashMap<String, String>>>,
pub initial_bases: Option<Vec<BasePath>>,
pub target_bases: Option<Vec<u32>>,
pub target_base_names_or_paths: Option<Vec<String>>,
}
Import
use lance::dataset::{Dataset, WriteParams, WriteMode};
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| batches | impl RecordBatchReader + Send + 'static |
Yes | Stream of Arrow RecordBatches containing the data to write. The schema is inferred from the reader. |
| dest | impl Into<WriteDestination<'_>> |
Yes | Destination path (URI string, object_store Path) or existing dataset (Arc<Dataset>).
|
| params | Option<WriteParams> |
No | Optional write parameters controlling file layout, write mode, storage backend, commit handler, and more. Defaults to WriteMode::Create with 1M rows/file and 1K rows/group.
|
Outputs
| Name | Type | Description |
|---|---|---|
| Result | Result<Dataset> |
The newly created or updated Dataset on success, or an error if creation fails (e.g., dataset already exists in Create mode). |
Usage Examples
Basic Example
use std::sync::Arc;
use arrow_array::{RecordBatch, RecordBatchIterator, Int64Array};
use arrow_schema::{Schema, Field, DataType};
use lance::dataset::{Dataset, WriteParams};
async fn create_dataset() -> lance::Result<()> {
let schema = Arc::new(Schema::new(vec![
Field::new("id", DataType::Int64, false),
]));
let array = Arc::new(Int64Array::from(vec![1, 2, 3, 4, 5]));
let batch = RecordBatch::try_new(schema.clone(), vec![array]).unwrap();
let reader = RecordBatchIterator::new(
vec![batch].into_iter().map(Ok),
schema,
);
let dataset = Dataset::write(reader, "data/my_dataset", None).await?;
println!("Created dataset v{}", dataset.version().version);
Ok(())
}
With Custom Parameters
use lance::dataset::{Dataset, WriteParams, WriteMode};
async fn create_with_params(reader: impl arrow_array::RecordBatchReader + Send + 'static) -> lance::Result<()> {
let params = WriteParams {
max_rows_per_file: 500_000,
max_rows_per_group: 2048,
mode: WriteMode::Create,
enable_v2_manifest_paths: true,
enable_stable_row_ids: true,
..Default::default()
};
let dataset = Dataset::write(reader, "s3://bucket/my_dataset", Some(params)).await?;
Ok(())
}
Related Pages
Implements Principle
Requires Environment
- Environment:Lance_format_Lance_Rust_Toolchain
- Environment:Lance_format_Lance_Python_Environment
- Environment:Lance_format_Lance_Cloud_Storage_Credentials