Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Apache Flink File Source Object Pooling and Record Wrapping

From Leeroopedia


Knowledge Sources
Domains File_Source, Object_Pooling, Record_Wrapping
Last Updated 2026-02-09 00:00 GMT

Overview

Description

The File Source Utilities are a set of common utility classes that support the file source connector's internal record processing pipeline. These utilities address three recurring concerns in high-throughput stream processing: object pooling to reduce garbage collection pressure, record wrappers that pair data records with checkpoint-compatible position metadata, and singleton result patterns for formats that emit one record at a time.

The key classes in this utility layer are:

  • Pool -- a bounded, thread-safe object pool backed by an ArrayBlockingQueue, designed to recycle heavyweight objects (such as column batches or row containers) between I/O threads and Flink's main processing threads.
  • RecordAndPosition -- an immutable value object that pairs a data record of generic type E with a two-part checkpoint position (byte offset and record skip count).
  • SingletonResultIterator -- a BulkFormat.RecordIterator optimized for the common case where a format produces exactly one record per read call, avoiding the overhead of array or collection wrappers.

Theoretical Basis

Object Pooling for Asynchronous Pipelines

In Flink's file source architecture, reading and record processing happen on different threads. A BulkFormat.Reader running on an I/O thread produces record batches, while Flink's main task thread consumes them. Because the reader cannot reuse a buffer until the processing side has finished with it, a producer-consumer pool is needed.

The Pool class implements this pattern with a fixed-capacity ArrayBlockingQueue. Objects are added once during initialization, then repeatedly taken (via pollEntry()) and returned (via the Recycler callback) throughout the source's lifetime. The blocking pollEntry() method naturally applies backpressure: if all pooled objects are in use, the reader thread blocks until one is recycled.

// Pool uses ArrayBlockingQueue for thread-safe bounded recycling
public class Pool<T> {
    private final ArrayBlockingQueue<T> pool;

    public T pollEntry() throws InterruptedException {
        return pool.take();  // blocks until an object is recycled back
    }
}

The Recycler interface is a single-method functional interface, enabling concise lambda-based recycling callbacks:

@FunctionalInterface
public interface Recycler<T> {
    void recycle(T object);
}

Position-Aware Record Wrappers

The RecordAndPosition class embodies the principle that every record emitted from a file source must carry enough information to reconstruct the reader's state during a checkpoint. It stores three fields:

Field Type Description
record E The actual data record.
offset long Byte offset (or NO_OFFSET for sequential formats).
recordSkipCount long Number of records to skip after the offset to reach the position after this record.

The class is intentionally designed with package-private, non-final fields so that its mutable subclass (MutableRecordAndPosition) can update them in-place without allocating new objects. This design trades strict immutability for allocation-free iteration in tight loops.

Singleton Result Pattern

The SingletonResultIterator addresses the case where a format yields one record per invocation (for example, a line-oriented text format). Rather than wrapping a single element in an array or list, the iterator holds exactly one MutableRecordAndPosition and returns it on the first call to next(), returning null on all subsequent calls. Like the array-backed iterator, it extends RecyclableIterator to support pool-based resource management via an optional recycler callback on releaseBatch().

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment