Principle:Risingwavelabs Risingwave Iceberg Table Maintenance
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Storage_Optimization, Iceberg |
| Last Updated | 2026-02-09 07:00 GMT |
Overview
A storage optimization practice that compacts small files, cleans up orphaned data, and expires old snapshots in Iceberg tables to maintain query performance and storage efficiency.
Description
Iceberg Table Maintenance addresses a fundamental challenge of streaming-to-lakehouse pipelines: streaming writers produce many small files (one per checkpoint), which degrades read performance and wastes storage.
Maintenance operations include:
- Rewrite Data Files: Compacts many small Parquet files into fewer, larger files
- Rewrite Manifests: Optimizes manifest files for faster metadata access
- Expire Snapshots: Removes old snapshot metadata that is no longer needed
- Remove Orphan Files: Cleans up data files that are not referenced by any snapshot
These operations are typically run periodically via schedulers like Apache Airflow or as ad-hoc Spark jobs.
Usage
Use Iceberg table maintenance when:
- Streaming sinks produce many small files over time
- Query performance on Iceberg tables degrades
- Storage costs increase due to accumulated snapshots and orphan files
- Setting up production maintenance schedules
Theoretical Basis
Small File Problem:
Streaming writes every N seconds → many small Parquet files
Query must open each file → high I/O overhead
Compaction Solution:
CALL system.rewrite_data_files(table)
- Reads small files
- Merges into larger files (target size ~128-512 MB)
- Atomically replaces old files with new ones
Maintenance Schedule (recommended):
Hourly: rewrite_data_files, rewrite_manifests
Daily: expire_snapshots, remove_orphan_files