Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Risingwavelabs Risingwave Iceberg Table Maintenance

From Leeroopedia


Knowledge Sources
Domains Data_Lake, Storage_Optimization, Iceberg
Last Updated 2026-02-09 07:00 GMT

Overview

A storage optimization practice that compacts small files, cleans up orphaned data, and expires old snapshots in Iceberg tables to maintain query performance and storage efficiency.

Description

Iceberg Table Maintenance addresses a fundamental challenge of streaming-to-lakehouse pipelines: streaming writers produce many small files (one per checkpoint), which degrades read performance and wastes storage.

Maintenance operations include:

  • Rewrite Data Files: Compacts many small Parquet files into fewer, larger files
  • Rewrite Manifests: Optimizes manifest files for faster metadata access
  • Expire Snapshots: Removes old snapshot metadata that is no longer needed
  • Remove Orphan Files: Cleans up data files that are not referenced by any snapshot

These operations are typically run periodically via schedulers like Apache Airflow or as ad-hoc Spark jobs.

Usage

Use Iceberg table maintenance when:

  • Streaming sinks produce many small files over time
  • Query performance on Iceberg tables degrades
  • Storage costs increase due to accumulated snapshots and orphan files
  • Setting up production maintenance schedules

Theoretical Basis

Small File Problem:
    Streaming writes every N seconds → many small Parquet files
    Query must open each file → high I/O overhead

Compaction Solution:
    CALL system.rewrite_data_files(table)
    - Reads small files
    - Merges into larger files (target size ~128-512 MB)
    - Atomically replaces old files with new ones

Maintenance Schedule (recommended):
    Hourly:  rewrite_data_files, rewrite_manifests
    Daily:   expire_snapshots, remove_orphan_files

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment