Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Apache Flink File Compaction Strategy

From Leeroopedia


Knowledge Sources
Domains Stream_Processing, File_IO
Last Updated 2026-02-09 00:00 GMT

Overview

A configurable strategy that determines when and how to merge small output files into larger ones, addressing the small-file problem in streaming file sinks.

Description

File Compaction Strategy solves the small-file problem that arises when streaming sinks produce many small part files (due to frequent checkpoints or rolling policies). Small files degrade filesystem performance (metadata overhead), downstream query performance (many small reads), and storage efficiency (block fragmentation).

The strategy defines:

  • When to compact: Based on checkpoint count or accumulated file size thresholds
  • How to compact: Two approaches:
    • Byte-level concatenation: Fast, simply appends file contents (suitable for row formats)
    • Record-level rewriting: Reads records and rewrites them (required for columnar formats that need proper footer/header structure)
  • Parallelism: Configurable thread pool for async compaction execution

Usage

Enable compaction when the sink produces many small files due to frequent checkpoints or short rolling intervals. Choose ConcatFileCompactor for text/row formats (faster) and RecordWiseFileCompactor for columnar formats (correct structure).

Theoretical Basis

// Abstract compaction trigger
function shouldCompact(bucket, checkpoint):
    if checkpointCount >= numCheckpointsBeforeCompaction:
        return COMPACT
    if accumulatedSize >= sizeThreshold:
        return COMPACT
    return WAIT

// Compaction strategies
function concatCompact(files):
    output = openOutputStream()
    for each file in files:
        copyBytes(file, output)
        if delimiter: output.write(delimiter)
    return output

function recordWiseCompact(files, readerFactory, writer):
    for each file in files:
        reader = readerFactory.createFor(file)
        while record = reader.read():
            writer.write(record)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment