Principle:Apache Flink Format Factory SPI
| Knowledge Sources | |
|---|---|
| Domains | Format_Factory, SPI |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Description
The Format Factory SPI (Service Provider Interface) defines Flink's pluggable extension mechanism for discovering and instantiating bulk file format readers and writers within the file system connector. Through a pair of marker interfaces -- BulkReaderFormatFactory and BulkWriterFormatFactory -- format providers (such as Parquet, ORC, or Avro) register themselves via Java's ServiceLoader mechanism. Flink's FactoryUtil then discovers these factories at runtime, matches them against user-specified format identifiers, and delegates format construction to the appropriate factory.
The SPI layer also includes the BulkDecodingFormat interface, which extends the Table API's DecodingFormat with file-source-specific capabilities such as best-effort filter pushdown.
Theoretical Basis
Service Provider Interface Pattern
The SPI pattern decouples format definition from format discovery. Format authors implement a factory interface and declare it in a META-INF/services file. At table creation time, Flink's FactoryUtil.createTableFactoryHelper() scans the classpath via ServiceLoader, finds all registered factories, and selects the one whose identifier matches the user's DDL configuration. This enables:
- Zero-code registration -- adding a new format requires only a JAR on the classpath; no Flink core changes are needed.
- Isolation -- each format lives in its own module with its own dependencies.
- Late binding -- format selection happens at runtime based on configuration, not at compile time.
Reading: BulkReaderFormatFactory
BulkReaderFormatFactory extends the Table API's DecodingFormatFactory parameterized with BulkFormat<RowData, FileSourceSplit>. Its sole contract is to produce a BulkDecodingFormat<RowData> from a table factory context and format-specific options:
public interface BulkReaderFormatFactory
extends DecodingFormatFactory<BulkFormat<RowData, FileSourceSplit>> {
@Override
BulkDecodingFormat<RowData> createDecodingFormat(
DynamicTableFactory.Context context, ReadableConfig formatOptions);
}
The returned BulkDecodingFormat acts as a bridge to the lower-level BulkFormat reader, adding optional filter pushdown support:
public interface BulkDecodingFormat<T>
extends DecodingFormat<BulkFormat<T, FileSourceSplit>> {
default void applyFilters(List<ResolvedExpression> filters) {}
}
This allows format implementations to accept predicate expressions from the query optimizer and apply them during file scanning for improved I/O efficiency (for example, Parquet column and row group pruning).
Writing: BulkWriterFormatFactory
BulkWriterFormatFactory mirrors the reader side by extending EncodingFormatFactory parameterized with BulkWriter.Factory<RowData>. The interface is intentionally minimal -- it inherits all behavior from its parent and adds no additional methods:
public interface BulkWriterFormatFactory
extends EncodingFormatFactory<BulkWriter.Factory<RowData>> {
// fully specified by generics
}
This symmetry between reader and writer factories ensures that format authors provide both halves of the format lifecycle through the same discovery mechanism.
Factory Discovery Flow
The end-to-end discovery sequence is:
| Step | Actor | Action |
|---|---|---|
| 1 | User | Declares FORMAT = 'parquet' in DDL.
|
| 2 | FactoryUtil |
Calls ServiceLoader.load(BulkReaderFormatFactory.class) to find all registered reader factories.
|
| 3 | FactoryUtil |
Matches factory identifier against user-specified format name. |
| 4 | Factory | createDecodingFormat() returns a BulkDecodingFormat bound to format-specific options.
|
| 5 | Table Source | Calls createRuntimeDecoder() on the format to obtain a BulkFormat reader.
|