Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Apache Flink Format Factory SPI

From Leeroopedia


Knowledge Sources
Domains Format_Factory, SPI
Last Updated 2026-02-09 00:00 GMT

Overview

Description

The Format Factory SPI (Service Provider Interface) defines Flink's pluggable extension mechanism for discovering and instantiating bulk file format readers and writers within the file system connector. Through a pair of marker interfaces -- BulkReaderFormatFactory and BulkWriterFormatFactory -- format providers (such as Parquet, ORC, or Avro) register themselves via Java's ServiceLoader mechanism. Flink's FactoryUtil then discovers these factories at runtime, matches them against user-specified format identifiers, and delegates format construction to the appropriate factory.

The SPI layer also includes the BulkDecodingFormat interface, which extends the Table API's DecodingFormat with file-source-specific capabilities such as best-effort filter pushdown.

Theoretical Basis

Service Provider Interface Pattern

The SPI pattern decouples format definition from format discovery. Format authors implement a factory interface and declare it in a META-INF/services file. At table creation time, Flink's FactoryUtil.createTableFactoryHelper() scans the classpath via ServiceLoader, finds all registered factories, and selects the one whose identifier matches the user's DDL configuration. This enables:

  • Zero-code registration -- adding a new format requires only a JAR on the classpath; no Flink core changes are needed.
  • Isolation -- each format lives in its own module with its own dependencies.
  • Late binding -- format selection happens at runtime based on configuration, not at compile time.

Reading: BulkReaderFormatFactory

BulkReaderFormatFactory extends the Table API's DecodingFormatFactory parameterized with BulkFormat<RowData, FileSourceSplit>. Its sole contract is to produce a BulkDecodingFormat<RowData> from a table factory context and format-specific options:

public interface BulkReaderFormatFactory
        extends DecodingFormatFactory<BulkFormat<RowData, FileSourceSplit>> {

    @Override
    BulkDecodingFormat<RowData> createDecodingFormat(
            DynamicTableFactory.Context context, ReadableConfig formatOptions);
}

The returned BulkDecodingFormat acts as a bridge to the lower-level BulkFormat reader, adding optional filter pushdown support:

public interface BulkDecodingFormat<T>
        extends DecodingFormat<BulkFormat<T, FileSourceSplit>> {

    default void applyFilters(List<ResolvedExpression> filters) {}
}

This allows format implementations to accept predicate expressions from the query optimizer and apply them during file scanning for improved I/O efficiency (for example, Parquet column and row group pruning).

Writing: BulkWriterFormatFactory

BulkWriterFormatFactory mirrors the reader side by extending EncodingFormatFactory parameterized with BulkWriter.Factory<RowData>. The interface is intentionally minimal -- it inherits all behavior from its parent and adds no additional methods:

public interface BulkWriterFormatFactory
        extends EncodingFormatFactory<BulkWriter.Factory<RowData>> {
    // fully specified by generics
}

This symmetry between reader and writer factories ensures that format authors provide both halves of the format lifecycle through the same discovery mechanism.

Factory Discovery Flow

The end-to-end discovery sequence is:

Step Actor Action
1 User Declares FORMAT = 'parquet' in DDL.
2 FactoryUtil Calls ServiceLoader.load(BulkReaderFormatFactory.class) to find all registered reader factories.
3 FactoryUtil Matches factory identifier against user-specified format name.
4 Factory createDecodingFormat() returns a BulkDecodingFormat bound to format-specific options.
5 Table Source Calls createRuntimeDecoder() on the format to obtain a BulkFormat reader.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment