Principle:Apache Flink File Compaction Strategy
| Knowledge Sources | |
|---|---|
| Domains | Stream_Processing, File_IO |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A configurable strategy that determines when and how to merge small output files into larger ones, addressing the small-file problem in streaming file sinks.
Description
File Compaction Strategy solves the small-file problem that arises when streaming sinks produce many small part files (due to frequent checkpoints or rolling policies). Small files degrade filesystem performance (metadata overhead), downstream query performance (many small reads), and storage efficiency (block fragmentation).
The strategy defines:
- When to compact: Based on checkpoint count or accumulated file size thresholds
- How to compact: Two approaches:
- Byte-level concatenation: Fast, simply appends file contents (suitable for row formats)
- Record-level rewriting: Reads records and rewrites them (required for columnar formats that need proper footer/header structure)
- Parallelism: Configurable thread pool for async compaction execution
Usage
Enable compaction when the sink produces many small files due to frequent checkpoints or short rolling intervals. Choose ConcatFileCompactor for text/row formats (faster) and RecordWiseFileCompactor for columnar formats (correct structure).
Theoretical Basis
// Abstract compaction trigger
function shouldCompact(bucket, checkpoint):
if checkpointCount >= numCheckpointsBeforeCompaction:
return COMPACT
if accumulatedSize >= sizeThreshold:
return COMPACT
return WAIT
// Compaction strategies
function concatCompact(files):
output = openOutputStream()
for each file in files:
copyBytes(file, output)
if delimiter: output.write(delimiter)
return output
function recordWiseCompact(files, readerFactory, writer):
for each file in files:
reader = readerFactory.createFor(file)
while record = reader.read():
writer.write(record)