Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Lance format Lance Compaction Planning

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Storage_Optimization
Last Updated 2026-02-08 19:00 GMT

Overview

Compaction planning is the process of analyzing a Lance dataset's fragment layout to determine which fragments should be rewritten, merged, or have deletions materialized in order to restore an optimal on-disk structure.

Description

As a Lance table accumulates writes, appends, and deletes, its physical layout degrades. Streaming appends produce many small fragments, while row deletions leave logical holes tracked by separate deletion files. Compaction planning inspects every fragment in the current manifest, classifies each one as a compaction candidate or not, and groups adjacent candidates into tasks that can be executed independently.

The planner iterates fragments in sorted order (by fragment ID) and evaluates two criteria:

  1. High deletion ratio (CompactItself): If materialize_deletions is enabled and the fraction of deleted rows exceeds materialize_deletions_threshold (default 10%), the fragment is marked for self-compaction. This rewrites the fragment without the deleted rows, reclaiming space and eliminating the deletion file overhead.
  1. Too few rows (CompactWithNeighbors): If a fragment has fewer rows than target_rows_per_fragment (default 1,048,576), it is marked as a neighbor-merge candidate. A single small fragment alone is a no-op; it only becomes actionable when adjacent fragments are also candidates, allowing them to be merged together.

Adjacent candidates are grouped into bins. A bin is only valid when its constituent fragments share the same index membership (indexed fragments cannot be merged with unindexed fragments). When a bin accumulates enough rows, it is split at the target_rows_per_fragment boundary so each resulting task produces approximately one full-sized fragment.

Usage

Use compaction planning when:

  • A dataset has accumulated many small fragments from streaming appends and read performance has degraded.
  • A significant fraction of rows have been deleted and you want to reclaim storage space.
  • You need to prepare a distributed compaction workflow where planning happens on a coordinator and execution happens on workers.

Theoretical Basis

Compaction planning follows a greedy bin-packing strategy over a linearly ordered sequence of fragments:

candidates = []
for each fragment in sorted_fragments:
    if deletion_ratio(fragment) > threshold:
        mark CompactItself
    else if row_count(fragment) < target:
        mark CompactWithNeighbors
    else:
        flush current bin

    if candidate and same index set as current bin:
        add to current bin
    else:
        start new bin

for each bin:
    if single fragment and only CompactWithNeighbors: skip (no-op)
    split bin at target_rows_per_fragment boundaries
    emit as CompactionTask

This ensures:

  • Locality preservation: Only adjacent fragments are merged, maintaining insertion order.
  • Index safety: Fragments covered by different index sets are never mixed.
  • Bounded task size: Each task targets approximately target_rows_per_fragment rows, preventing unbounded memory usage during execution.

The time complexity of planning is O(n) in the number of fragments, with an additional async I/O cost to load deletion metadata for each fragment.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment