Heuristic:Lance format Lance IO Buffer And Batch Sizing
| Knowledge Sources | |
|---|---|
| Domains | Performance, Optimization |
| Last Updated | 2026-02-08 19:00 GMT |
Overview
I/O performance tuning defaults: 2GB I/O buffer (for ~256 concurrent cloud reads), 8192 row batch size, and 4 fragment readahead for sequential scans.
Description
Lance's scanner uses several tunable parameters to balance memory usage against I/O throughput when reading datasets. These defaults are optimized for cloud storage where latency is high and concurrent requests are beneficial. The I/O buffer size of 2GB is designed to support approximately 256 concurrent reads (typical 8 MiB page size * 256 = 2 GB). All values are configurable via environment variables for production tuning without recompilation.
Usage
Apply this heuristic when tuning scan performance, diagnosing memory issues during reads, or configuring Lance for specific hardware. Reduce buffer sizes on memory-constrained systems. Increase batch size for throughput-oriented workloads. Adjust fragment readahead based on storage latency (higher for cloud, lower for local SSD).
The Insight (Rule of Thumb)
- I/O Buffer Size: 2 GB (configurable via `LANCE_DEFAULT_IO_BUFFER_SIZE`)
- Rationale: Supports ~256 concurrent reads on cloud storage. Typical page size is 8 MiB; 256 * 8 MiB = 2 GB.
- Batch Size: 8,192 rows per batch (configurable via `LANCE_DEFAULT_BATCH_SIZE`)
- Rationale: Balances memory usage with processing efficiency for Arrow record batches.
- Fragment Readahead: 4 fragments (configurable via `LANCE_DEFAULT_FRAGMENT_READAHEAD`)
- Rationale: Pre-fetches upcoming fragments during sequential scans to hide I/O latency.
- Trade-off: Higher I/O buffer = more concurrent reads = faster cloud scans, but more memory. Higher batch size = better throughput, but more memory per batch. Higher readahead = better sequential performance, but more memory pressure.
Reasoning
Cloud object stores (S3, GCS, Azure) have high per-request latency (50-200ms) but support high concurrency. The 2GB buffer allows Lance to issue many concurrent range requests, effectively hiding latency through parallelism. The 8,192 row batch size is a common default in the Arrow ecosystem that balances columnar processing efficiency with memory overhead. Fragment readahead of 4 provides enough lookahead for sequential scans without excessive memory use.
For local SSD storage where latency is low (< 1ms), these defaults are generous. Users can reduce buffer sizes to save memory without significant performance impact.
Code Evidence
I/O buffer size from `rust/lance/src/dataset/scanner.rs:155-163`:
const DEFAULT_IO_BUFFER_SIZE_VALUE: u64 = 2 * 1024 * 1024 * 1024; // 2 GB
pub static DEFAULT_IO_BUFFER_SIZE: LazyLock<u64> = LazyLock::new(|| {
parse_env_var(
"LANCE_DEFAULT_IO_BUFFER_SIZE",
&DEFAULT_IO_BUFFER_SIZE_VALUE.to_string(),
)
.unwrap_or(DEFAULT_IO_BUFFER_SIZE_VALUE)
});
Batch size from `rust/lance/src/dataset/scanner.rs:103,129-131`:
pub(crate) const BATCH_SIZE_FALLBACK: usize = 8192;
pub fn get_default_batch_size() -> Option<usize> {
parse_env_var("LANCE_DEFAULT_BATCH_SIZE", &BATCH_SIZE_FALLBACK.to_string())
}
Fragment readahead from `rust/lance/src/dataset/scanner.rs:133`:
pub const LEGACY_DEFAULT_FRAGMENT_READAHEAD: usize = 4;