Workflow:Apache Flink Hybrid Source Switching

Knowledge Sources	Apache Flink Flink HybridSource Docs
Domains	Data_Engineering, Stream_Processing, Source_Orchestration
Last Updated	2026-02-09 13:00 GMT

Overview

End-to-end process for composing multiple Flink sources into a sequential chain using HybridSource, enabling seamless transitions from bounded historical data to unbounded real-time streams.

Description

This workflow describes how to use Flink's HybridSource to chain multiple data sources that execute sequentially. The primary use case is backfilling: first reading historical data from a bounded source (e.g., files, database snapshot), then seamlessly switching to an unbounded source (e.g., Kafka, Kinesis) for real-time processing. The HybridSource coordinates the lifecycle of underlying sources, managing reader state transitions and split enumeration across source boundaries. Each source except the last must be bounded, and the switch context allows passing positional information between sources for gap-free transitions.

Key capabilities:

Sequential chaining of two or more sources
Seamless bounded-to-unbounded transitions
SourceSwitchContext for position-based start offset derivation
Automatic reader lifecycle management across source boundaries
Checkpoint and recovery support across source switches
Split wrapping for unified state management

Usage

Execute this workflow when you need to process historical backfill data followed by real-time streaming data in a single Flink job, without requiring separate batch and streaming jobs. Common scenarios include: reading from file storage then switching to a message queue, reading from a database snapshot then switching to a change data capture stream, or chaining multiple bounded sources for staged data ingestion.

Execution Steps

Step 1: Define Source Chain

Configure the HybridSource using the HybridSourceBuilder to specify the ordered sequence of underlying sources. Each source is registered with its factory and boundedness declaration. The first source is added directly, while subsequent sources can use a SourceFactory that receives the previous source's enumerator state for dynamic start position derivation.

Key considerations:

All sources except the last must be bounded
The last source can be either bounded or unbounded
SourceFactory enables lazy source construction with context from the previous source
SourceSwitchContext provides access to the previous enumerator for position derivation
Sources are indexed internally starting from 0

Step 2: Initialize First Source

When the HybridSource starts, it creates the first underlying source's SplitEnumerator and SourceReader. The HybridSourceSplitEnumerator wraps the first source's enumerator, intercepting split assignments and reader registrations. The HybridSourceReader wraps the underlying reader and tracks the current source index.

What happens:

The HybridSourceSplitEnumerator creates the first source's enumerator
All parallel HybridSourceReaders are initialized with the first source's reader
Splits from the first source are wrapped in HybridSourceSplit containers
The underlying reader begins processing its assigned splits

Step 3: Process Current Source Until Completion

The current source's reader processes its splits normally. As each reader finishes its assigned splits, it sends a SourceReaderFinishedEvent to the enumerator. The enumerator tracks which readers have completed the current source and waits for all readers to finish before initiating the switch.

Key considerations:

Readers process splits through the standard FLIP-27 source reader pipeline
SourceReaderFinishedEvent signals reader-level completion
The enumerator uses SupportsIntermediateNoMoreSplits to signal per-source end
All readers must finish the current source before switching occurs
Pending splits from the current source are preserved for recovery

Step 4: Switch to Next Source

When all readers have finished the current source, the HybridSourceSplitEnumerator closes the current source's enumerator and creates the next one. If a SourceFactory was provided, it receives the SourceSwitchContext containing the previous enumerator for start position derivation. A SwitchSourceEvent is sent to all readers to transition them to the new source.

What happens:

The current source's enumerator is closed
The next source is created (directly or via SourceFactory with switch context)
The new source's enumerator is initialized and started
SwitchSourceEvent (containing the new source index) is broadcast to all readers
Each HybridSourceReader creates a new underlying reader for the next source
Split assignment begins for the new source

Step 5: Handle Checkpoints Across Source Boundaries

The HybridSource maintains consistent checkpoint state across source switches. The HybridSourceEnumeratorState captures the current source index, the underlying enumerator's state, and any pending splits from previous sources. On recovery, the source chain is replayed up to the checkpointed source index, and processing resumes from the correct position.

Key considerations:

HybridSourceEnumeratorState wraps the underlying enumerator state with source index
HybridSourceSplit wraps underlying splits with source index for correct deserialization
Recovery replays source creation up to the checkpointed index
Pending splits from previous sources are preserved for subtask reset scenarios
SwitchedSources maintains a registry of source factories for recovery

Execution Diagram

GitHub URL

Workflow Repository