Workflow:Apache Flink File Source Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Stream_Processing, File_IO |
| Last Updated | 2026-02-09 13:00 GMT |
Overview
End-to-end process for reading files into a Flink pipeline using the FLIP-27 FileSource connector with split-based parallelism, format readers, and optional continuous monitoring.
Description
This workflow describes the standard procedure for ingesting file-based data into a Flink pipeline using the modern FileSource API built on the FLIP-27 source framework. The source enumerates input files, splits them for parallel processing, and reads records using configurable format readers (stream-based for line-delimited formats, bulk-based for columnar formats). It supports both bounded batch mode (read files once) and continuous streaming mode (monitor directories for new files). The split assignment can be locality-aware to optimize data locality on distributed filesystems.
Key capabilities:
- Supports both bounded (batch) and continuous (streaming) modes
- Stream format readers for line-delimited text, CSV
- Bulk format readers for columnar formats (Parquet, ORC)
- Locality-aware split assignment for data locality optimization
- File filtering and directory recursion
- Checkpoint-based split state persistence
- Decompression support for compressed files
Usage
Execute this workflow when you need to read data from files on a local filesystem, HDFS, or S3-compatible storage into a Flink DataStream or Table pipeline. Use bounded mode for one-time batch processing of existing files, or continuous mode for monitoring directories and processing new files as they appear.
Execution Steps
Step 1: Configure File Source Builder
Initialize the FileSource builder with the input path(s) and the appropriate format reader. Choose between stream format (for record-by-record reading of text-based files) or bulk format (for batch reading of columnar files). The builder accepts one or more input paths to monitor.
Key considerations:
- Use forRecordStreamFormat for line-delimited text files (e.g., TextLineInputFormat)
- Use forBulkFileFormat for columnar files (e.g., Parquet, ORC)
- Multiple input paths can be specified for reading from several directories
- The format determines how file bytes are deserialized into records
Step 2: Configure Enumeration and Splitting
Set up how input files are discovered and divided into processing splits. The file enumerator recursively scans input directories, applies file filters, and creates splits. For splittable formats, large files are divided into block-aligned splits for parallel processing. Non-splittable formats create one split per file.
Key considerations:
- BlockSplittingRecursiveEnumerator creates sub-file splits aligned to filesystem block boundaries
- NonSplittingRecursiveEnumerator creates whole-file splits
- DefaultFileFilter excludes hidden files (starting with . or _)
- RegexFileFilter supports custom file name filtering patterns
- File enumeration runs in the SplitEnumerator on the JobManager
Step 3: Assign Splits to Readers
The split enumerator distributes discovered splits to parallel source readers. In bounded mode, a StaticFileSplitEnumerator assigns all splits once. In continuous mode, a ContinuousFileSplitEnumerator periodically re-enumerates and assigns new splits. The locality-aware assigner optimizes for data locality on distributed filesystems.
Key considerations:
- LocalityAwareSplitAssigner considers file block locations when assigning splits
- SimpleSplitAssigner uses round-robin distribution without locality awareness
- In continuous mode, the discovery interval controls re-enumeration frequency
- Splits can be redistributed on reader rescaling
Step 4: Read Records from Splits
Each parallel FileSourceReader processes its assigned splits through the FileSourceSplitReader. The split reader opens files, seeks to the split offset, and reads records using the configured format reader. Records are batched in FileRecords containers and passed through the RecordEmitter to the output.
What happens:
- FileSourceSplitReader opens the file at the split's start offset
- The format reader (StreamFormat or BulkFormat) deserializes bytes into records
- Records are wrapped in RecordAndPosition containers with checkpoint position metadata
- FileSourceRecordEmitter passes records to the output collector
- Split state (current offset, records read) is tracked for checkpointing
Step 5: Handle Checkpoints and Recovery
On each checkpoint, the source persists split assignment state and per-reader progress. The PendingSplitsCheckpoint captures which splits have been assigned, which are pending, and optionally the enumerator's file discovery state. On recovery, splits are restored and reading resumes from the checkpointed position.
Key considerations:
- Split state includes the current file offset and records-to-skip count
- In continuous mode, already-processed file paths are tracked to avoid re-reading
- PendingSplitsCheckpointSerializer handles versioned state serialization
- Recovery re-assigns unfinished splits to available readers