Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Apache Paimon Deletion Vectors

From Leeroopedia


Knowledge Sources
Domains Storage, Updates
Last Updated 2026-02-08 00:00 GMT

Overview

Deletion vectors are bitmap-based structures that track deleted rows in immutable data files, enabling efficient updates without rewriting entire files.

Description

The deletion vector principle solves a fundamental challenge in data lake systems: how to support row-level updates and deletes on immutable storage files without incurring the cost of rewriting large data files. Instead of modifying data files in place, the system maintains separate bitmap structures that record which rows have been logically deleted. When reading data, the system consults the deletion vector to filter out deleted rows, presenting a consistent view of the current data state.

This approach provides significant performance benefits for workloads with selective updates. When only a small percentage of rows in a large file need to be deleted or updated, rewriting the entire file would waste substantial I/O and storage resources. Deletion vectors allow the system to record these changes compactly, deferring the physical file rewrites until a background compaction process runs. The bitmap representation is space-efficient, typically using one bit per row, and supports fast lookups to determine whether a specific row position should be included in query results.

The implementation leverages compressed bitmap data structures that provide both space efficiency and fast bitwise operations. Roaring bitmaps, for example, use adaptive encoding that switches between array-based and bitmap-based representations depending on data density. During query execution, readers apply deletion vectors as a filter layer, checking each row's position against the bitmap before returning it to the caller. This lazy deletion approach maintains the benefits of immutable storage while providing update semantics similar to traditional databases.

Usage

Apply deletion vectors when implementing copy-on-write semantics for data lakes, supporting GDPR-compliant deletion requirements, or optimizing update-heavy workloads where only a small fraction of rows change. This pattern is particularly valuable when data files are large and updates are sparse.

Theoretical Basis

The deletion vector pattern operates through the following mechanism:

Bitmap Structure

  • Each data file has an optional associated deletion vector
  • The bitmap maps row positions (0 to N-1) to deletion status
  • Bit value 1 indicates the row is deleted, 0 indicates it exists
  • Compressed bitmap encodings minimize storage overhead

Update Operation

  • When deleting rows, identify their positions in the source file
  • Set corresponding bits in the deletion vector bitmap
  • Write updated deletion vector to storage
  • Update metadata to reference the new deletion vector

Read Operation with Deletion Vector

function readWithDeletionVector(dataFile, deletionVector):
    reader = openDataFile(dataFile)
    rowPosition = 0

    while reader.hasNext():
        row = reader.readNext()

        if not deletionVector.isDeleted(rowPosition):
            yield row

        rowPosition++

Compaction Strategy

  • Monitor deletion vector density for each file
  • When deleted row percentage exceeds threshold (e.g., 30%):
 * Read surviving rows from original file
 * Write compacted file without deleted rows
 * Discard old file and deletion vector
  • This reclaims storage and improves read performance

Bitmap Operations

interface DeletionVector:
    function isDeleted(position) -> boolean
    function markDeleted(position)
    function merge(otherVector) -> DeletionVector
    function cardinality() -> integer  // count of deleted rows
    function serialize() -> bytes

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment