Principle:Apache Flink File Compaction Strategy

Knowledge Sources	Apache Flink FileSink Apache Flink
Domains	Stream_Processing, File_IO
Last Updated	2026-02-09 00:00 GMT

Overview

A configurable strategy that determines when and how to merge small output files into larger ones, addressing the small-file problem in streaming file sinks.

Description

File Compaction Strategy solves the small-file problem that arises when streaming sinks produce many small part files (due to frequent checkpoints or rolling policies). Small files degrade filesystem performance (metadata overhead), downstream query performance (many small reads), and storage efficiency (block fragmentation).

The strategy defines:

When to compact: Based on checkpoint count or accumulated file size thresholds
How to compact: Two approaches:
- Byte-level concatenation: Fast, simply appends file contents (suitable for row formats)
- Record-level rewriting: Reads records and rewrites them (required for columnar formats that need proper footer/header structure)
Parallelism: Configurable thread pool for async compaction execution

Usage

Enable compaction when the sink produces many small files due to frequent checkpoints or short rolling intervals. Choose ConcatFileCompactor for text/row formats (faster) and RecordWiseFileCompactor for columnar formats (correct structure).

Theoretical Basis

// Abstract compaction trigger
function shouldCompact(bucket, checkpoint):
    if checkpointCount >= numCheckpointsBeforeCompaction:
        return COMPACT
    if accumulatedSize >= sizeThreshold:
        return COMPACT
    return WAIT

// Compaction strategies
function concatCompact(files):
    output = openOutputStream()
    for each file in files:
        copyBytes(file, output)
        if delimiter: output.write(delimiter)
    return output

function recordWiseCompact(files, readerFactory, writer):
    for each file in files:
        reader = readerFactory.createFor(file)
        while record = reader.read():
            writer.write(record)

Related Pages

Implemented By

Implementation:Apache_Flink_FileCompactStrategy_Builder

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment