Workflow:Apache Flink File Source Pipeline

Knowledge Sources	Apache Flink Flink FileSource Docs
Domains	Data_Engineering, Stream_Processing, File_IO
Last Updated	2026-02-09 13:00 GMT

Overview

End-to-end process for reading files into a Flink pipeline using the FLIP-27 FileSource connector with split-based parallelism, format readers, and optional continuous monitoring.

Description

This workflow describes the standard procedure for ingesting file-based data into a Flink pipeline using the modern FileSource API built on the FLIP-27 source framework. The source enumerates input files, splits them for parallel processing, and reads records using configurable format readers (stream-based for line-delimited formats, bulk-based for columnar formats). It supports both bounded batch mode (read files once) and continuous streaming mode (monitor directories for new files). The split assignment can be locality-aware to optimize data locality on distributed filesystems.

Key capabilities:

Supports both bounded (batch) and continuous (streaming) modes
Stream format readers for line-delimited text, CSV
Bulk format readers for columnar formats (Parquet, ORC)
Locality-aware split assignment for data locality optimization
File filtering and directory recursion
Checkpoint-based split state persistence
Decompression support for compressed files

Usage

Execute this workflow when you need to read data from files on a local filesystem, HDFS, or S3-compatible storage into a Flink DataStream or Table pipeline. Use bounded mode for one-time batch processing of existing files, or continuous mode for monitoring directories and processing new files as they appear.

Execution Steps

Step 1: Configure File Source Builder

Initialize the FileSource builder with the input path(s) and the appropriate format reader. Choose between stream format (for record-by-record reading of text-based files) or bulk format (for batch reading of columnar files). The builder accepts one or more input paths to monitor.

Key considerations:

Use forRecordStreamFormat for line-delimited text files (e.g., TextLineInputFormat)
Use forBulkFileFormat for columnar files (e.g., Parquet, ORC)
Multiple input paths can be specified for reading from several directories
The format determines how file bytes are deserialized into records

Step 2: Configure Enumeration and Splitting

Set up how input files are discovered and divided into processing splits. The file enumerator recursively scans input directories, applies file filters, and creates splits. For splittable formats, large files are divided into block-aligned splits for parallel processing. Non-splittable formats create one split per file.

Key considerations:

BlockSplittingRecursiveEnumerator creates sub-file splits aligned to filesystem block boundaries
NonSplittingRecursiveEnumerator creates whole-file splits
DefaultFileFilter excludes hidden files (starting with . or _)
RegexFileFilter supports custom file name filtering patterns
File enumeration runs in the SplitEnumerator on the JobManager

Step 3: Assign Splits to Readers

The split enumerator distributes discovered splits to parallel source readers. In bounded mode, a StaticFileSplitEnumerator assigns all splits once. In continuous mode, a ContinuousFileSplitEnumerator periodically re-enumerates and assigns new splits. The locality-aware assigner optimizes for data locality on distributed filesystems.

Key considerations:

LocalityAwareSplitAssigner considers file block locations when assigning splits
SimpleSplitAssigner uses round-robin distribution without locality awareness
In continuous mode, the discovery interval controls re-enumeration frequency
Splits can be redistributed on reader rescaling

Step 4: Read Records from Splits

Each parallel FileSourceReader processes its assigned splits through the FileSourceSplitReader. The split reader opens files, seeks to the split offset, and reads records using the configured format reader. Records are batched in FileRecords containers and passed through the RecordEmitter to the output.

What happens:

FileSourceSplitReader opens the file at the split's start offset
The format reader (StreamFormat or BulkFormat) deserializes bytes into records
Records are wrapped in RecordAndPosition containers with checkpoint position metadata
FileSourceRecordEmitter passes records to the output collector
Split state (current offset, records read) is tracked for checkpointing

Step 5: Handle Checkpoints and Recovery

On each checkpoint, the source persists split assignment state and per-reader progress. The PendingSplitsCheckpoint captures which splits have been assigned, which are pending, and optionally the enumerator's file discovery state. On recovery, splits are restored and reading resumes from the checkpointed position.

Key considerations:

Split state includes the current file offset and records-to-skip count
In continuous mode, already-processed file paths are tracked to avoid re-reading
PendingSplitsCheckpointSerializer handles versioned state serialization
Recovery re-assigns unfinished splits to available readers

Execution Diagram

GitHub URL

Workflow Repository