Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Apache Flink File Source Pipeline

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Stream_Processing, File_IO
Last Updated 2026-02-09 13:00 GMT

Overview

End-to-end process for reading files into a Flink pipeline using the FLIP-27 FileSource connector with split-based parallelism, format readers, and optional continuous monitoring.

Description

This workflow describes the standard procedure for ingesting file-based data into a Flink pipeline using the modern FileSource API built on the FLIP-27 source framework. The source enumerates input files, splits them for parallel processing, and reads records using configurable format readers (stream-based for line-delimited formats, bulk-based for columnar formats). It supports both bounded batch mode (read files once) and continuous streaming mode (monitor directories for new files). The split assignment can be locality-aware to optimize data locality on distributed filesystems.

Key capabilities:

  • Supports both bounded (batch) and continuous (streaming) modes
  • Stream format readers for line-delimited text, CSV
  • Bulk format readers for columnar formats (Parquet, ORC)
  • Locality-aware split assignment for data locality optimization
  • File filtering and directory recursion
  • Checkpoint-based split state persistence
  • Decompression support for compressed files

Usage

Execute this workflow when you need to read data from files on a local filesystem, HDFS, or S3-compatible storage into a Flink DataStream or Table pipeline. Use bounded mode for one-time batch processing of existing files, or continuous mode for monitoring directories and processing new files as they appear.

Execution Steps

Step 1: Configure File Source Builder

Initialize the FileSource builder with the input path(s) and the appropriate format reader. Choose between stream format (for record-by-record reading of text-based files) or bulk format (for batch reading of columnar files). The builder accepts one or more input paths to monitor.

Key considerations:

  • Use forRecordStreamFormat for line-delimited text files (e.g., TextLineInputFormat)
  • Use forBulkFileFormat for columnar files (e.g., Parquet, ORC)
  • Multiple input paths can be specified for reading from several directories
  • The format determines how file bytes are deserialized into records

Step 2: Configure Enumeration and Splitting

Set up how input files are discovered and divided into processing splits. The file enumerator recursively scans input directories, applies file filters, and creates splits. For splittable formats, large files are divided into block-aligned splits for parallel processing. Non-splittable formats create one split per file.

Key considerations:

  • BlockSplittingRecursiveEnumerator creates sub-file splits aligned to filesystem block boundaries
  • NonSplittingRecursiveEnumerator creates whole-file splits
  • DefaultFileFilter excludes hidden files (starting with . or _)
  • RegexFileFilter supports custom file name filtering patterns
  • File enumeration runs in the SplitEnumerator on the JobManager

Step 3: Assign Splits to Readers

The split enumerator distributes discovered splits to parallel source readers. In bounded mode, a StaticFileSplitEnumerator assigns all splits once. In continuous mode, a ContinuousFileSplitEnumerator periodically re-enumerates and assigns new splits. The locality-aware assigner optimizes for data locality on distributed filesystems.

Key considerations:

  • LocalityAwareSplitAssigner considers file block locations when assigning splits
  • SimpleSplitAssigner uses round-robin distribution without locality awareness
  • In continuous mode, the discovery interval controls re-enumeration frequency
  • Splits can be redistributed on reader rescaling

Step 4: Read Records from Splits

Each parallel FileSourceReader processes its assigned splits through the FileSourceSplitReader. The split reader opens files, seeks to the split offset, and reads records using the configured format reader. Records are batched in FileRecords containers and passed through the RecordEmitter to the output.

What happens:

  • FileSourceSplitReader opens the file at the split's start offset
  • The format reader (StreamFormat or BulkFormat) deserializes bytes into records
  • Records are wrapped in RecordAndPosition containers with checkpoint position metadata
  • FileSourceRecordEmitter passes records to the output collector
  • Split state (current offset, records read) is tracked for checkpointing

Step 5: Handle Checkpoints and Recovery

On each checkpoint, the source persists split assignment state and per-reader progress. The PendingSplitsCheckpoint captures which splits have been assigned, which are pending, and optionally the enumerator's file discovery state. On recovery, splits are restored and reading resumes from the checkpointed position.

Key considerations:

  • Split state includes the current file offset and records-to-skip count
  • In continuous mode, already-processed file paths are tracked to avoid re-reading
  • PendingSplitsCheckpointSerializer handles versioned state serialization
  • Recovery re-assigns unfinished splits to available readers

Execution Diagram

GitHub URL

Workflow Repository