Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Lance format Lance Fragment Sizing Strategy

From Leeroopedia



Knowledge Sources
Domains Optimization, Storage
Last Updated 2026-02-08 19:00 GMT

Overview

Fragment sizing strategy using 1M rows per fragment and 90GB max file size, with 10% deletion threshold for compaction materialization.

Description

Lance organizes data into fragments (partitions), each containing one or more data files. The fragment size directly impacts I/O efficiency, scan performance, and compaction overhead. The defaults are tuned for a balance between batch I/O throughput (larger fragments = fewer I/O requests) and memory usage (smaller fragments = lower peak memory). The 90GB per-file cap accounts for cloud storage limits (S3 has a 100GB object limit).

Usage

Apply this heuristic when writing new datasets, configuring compaction, or troubleshooting scan performance. If scans are slow due to many small fragments, increase the target. If memory is constrained during writes, decrease `max_rows_per_file`. The 10% deletion threshold determines when compaction will rewrite fragments to reclaim space from soft-deleted rows.

The Insight (Rule of Thumb)

  • Action: Set `target_rows_per_fragment` and `max_rows_per_file` to 1,048,576 (1M rows).
  • Value: 1M rows per fragment; 1,024 rows per row group; 90GB max bytes per file.
  • Trade-off: Larger fragments improve scan throughput but increase memory during writes and compaction. The 90GB limit is 10GB below S3's 100GB object cap as a safety margin.
  • Deletion threshold: Set `materialize_deletions_threshold` to 0.1 (10%). Fragments with more than 10% deleted rows will have deletions materialized during compaction.
  • Binary copy batch size: 16MB for page-level binary copies during fragment merging.

Reasoning

Fragment size of 1M rows is a balance between minimizing the number of fragments (which reduces manifest overhead and scan initialization cost) and keeping individual fragment processing within reasonable memory bounds. The 1,024-row row group size matches Arrow's typical batch size for efficient columnar operations. The 10% deletion threshold prevents unnecessary rewrites of lightly-deleted fragments while ensuring heavily-deleted fragments are cleaned up. The 90GB file cap provides a 10% safety margin below S3's 100GB per-object limit.

Code Evidence

CompactionOptions defaults from `rust/lance/src/dataset/optimize.rs:181-198`:

impl Default for CompactionOptions {
    fn default() -> Self {
        Self {
            target_rows_per_fragment: 1024 * 1024,
            max_rows_per_group: 1024,
            materialize_deletions: true,
            materialize_deletions_threshold: 0.1,
            num_threads: None,
            max_bytes_per_file: None,
            batch_size: None,
            defer_index_remap: false,
            enable_binary_copy: false,
            enable_binary_copy_force: false,
            binary_copy_read_batch_bytes: Some(16 * 1024 * 1024),
        }
    }
}

WriteParams defaults from `rust/lance/src/dataset/write.rs:250-257`:

impl Default for WriteParams {
    fn default() -> Self {
        Self {
            max_rows_per_file: 1024 * 1024,
            max_rows_per_group: 1024,
            max_bytes_per_file: 90 * 1024 * 1024 * 1024,
            // ...
        }
    }
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment