Workflow:Apache Flink Hybrid Source Switching
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Stream_Processing, Source_Orchestration |
| Last Updated | 2026-02-09 13:00 GMT |
Overview
End-to-end process for composing multiple Flink sources into a sequential chain using HybridSource, enabling seamless transitions from bounded historical data to unbounded real-time streams.
Description
This workflow describes how to use Flink's HybridSource to chain multiple data sources that execute sequentially. The primary use case is backfilling: first reading historical data from a bounded source (e.g., files, database snapshot), then seamlessly switching to an unbounded source (e.g., Kafka, Kinesis) for real-time processing. The HybridSource coordinates the lifecycle of underlying sources, managing reader state transitions and split enumeration across source boundaries. Each source except the last must be bounded, and the switch context allows passing positional information between sources for gap-free transitions.
Key capabilities:
- Sequential chaining of two or more sources
- Seamless bounded-to-unbounded transitions
- SourceSwitchContext for position-based start offset derivation
- Automatic reader lifecycle management across source boundaries
- Checkpoint and recovery support across source switches
- Split wrapping for unified state management
Usage
Execute this workflow when you need to process historical backfill data followed by real-time streaming data in a single Flink job, without requiring separate batch and streaming jobs. Common scenarios include: reading from file storage then switching to a message queue, reading from a database snapshot then switching to a change data capture stream, or chaining multiple bounded sources for staged data ingestion.
Execution Steps
Step 1: Define Source Chain
Configure the HybridSource using the HybridSourceBuilder to specify the ordered sequence of underlying sources. Each source is registered with its factory and boundedness declaration. The first source is added directly, while subsequent sources can use a SourceFactory that receives the previous source's enumerator state for dynamic start position derivation.
Key considerations:
- All sources except the last must be bounded
- The last source can be either bounded or unbounded
- SourceFactory enables lazy source construction with context from the previous source
- SourceSwitchContext provides access to the previous enumerator for position derivation
- Sources are indexed internally starting from 0
Step 2: Initialize First Source
When the HybridSource starts, it creates the first underlying source's SplitEnumerator and SourceReader. The HybridSourceSplitEnumerator wraps the first source's enumerator, intercepting split assignments and reader registrations. The HybridSourceReader wraps the underlying reader and tracks the current source index.
What happens:
- The HybridSourceSplitEnumerator creates the first source's enumerator
- All parallel HybridSourceReaders are initialized with the first source's reader
- Splits from the first source are wrapped in HybridSourceSplit containers
- The underlying reader begins processing its assigned splits
Step 3: Process Current Source Until Completion
The current source's reader processes its splits normally. As each reader finishes its assigned splits, it sends a SourceReaderFinishedEvent to the enumerator. The enumerator tracks which readers have completed the current source and waits for all readers to finish before initiating the switch.
Key considerations:
- Readers process splits through the standard FLIP-27 source reader pipeline
- SourceReaderFinishedEvent signals reader-level completion
- The enumerator uses SupportsIntermediateNoMoreSplits to signal per-source end
- All readers must finish the current source before switching occurs
- Pending splits from the current source are preserved for recovery
Step 4: Switch to Next Source
When all readers have finished the current source, the HybridSourceSplitEnumerator closes the current source's enumerator and creates the next one. If a SourceFactory was provided, it receives the SourceSwitchContext containing the previous enumerator for start position derivation. A SwitchSourceEvent is sent to all readers to transition them to the new source.
What happens:
- The current source's enumerator is closed
- The next source is created (directly or via SourceFactory with switch context)
- The new source's enumerator is initialized and started
- SwitchSourceEvent (containing the new source index) is broadcast to all readers
- Each HybridSourceReader creates a new underlying reader for the next source
- Split assignment begins for the new source
Step 5: Handle Checkpoints Across Source Boundaries
The HybridSource maintains consistent checkpoint state across source switches. The HybridSourceEnumeratorState captures the current source index, the underlying enumerator's state, and any pending splits from previous sources. On recovery, the source chain is replayed up to the checkpointed source index, and processing resumes from the correct position.
Key considerations:
- HybridSourceEnumeratorState wraps the underlying enumerator state with source index
- HybridSourceSplit wraps underlying splits with source index for correct deserialization
- Recovery replays source creation up to the checkpointed index
- Pending splits from previous sources are preserved for subtask reset scenarios
- SwitchedSources maintains a registry of source factories for recovery