Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Apache Flink Hybrid Source Switching

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Stream_Processing, Source_Orchestration
Last Updated 2026-02-09 13:00 GMT

Overview

End-to-end process for composing multiple Flink sources into a sequential chain using HybridSource, enabling seamless transitions from bounded historical data to unbounded real-time streams.

Description

This workflow describes how to use Flink's HybridSource to chain multiple data sources that execute sequentially. The primary use case is backfilling: first reading historical data from a bounded source (e.g., files, database snapshot), then seamlessly switching to an unbounded source (e.g., Kafka, Kinesis) for real-time processing. The HybridSource coordinates the lifecycle of underlying sources, managing reader state transitions and split enumeration across source boundaries. Each source except the last must be bounded, and the switch context allows passing positional information between sources for gap-free transitions.

Key capabilities:

  • Sequential chaining of two or more sources
  • Seamless bounded-to-unbounded transitions
  • SourceSwitchContext for position-based start offset derivation
  • Automatic reader lifecycle management across source boundaries
  • Checkpoint and recovery support across source switches
  • Split wrapping for unified state management

Usage

Execute this workflow when you need to process historical backfill data followed by real-time streaming data in a single Flink job, without requiring separate batch and streaming jobs. Common scenarios include: reading from file storage then switching to a message queue, reading from a database snapshot then switching to a change data capture stream, or chaining multiple bounded sources for staged data ingestion.

Execution Steps

Step 1: Define Source Chain

Configure the HybridSource using the HybridSourceBuilder to specify the ordered sequence of underlying sources. Each source is registered with its factory and boundedness declaration. The first source is added directly, while subsequent sources can use a SourceFactory that receives the previous source's enumerator state for dynamic start position derivation.

Key considerations:

  • All sources except the last must be bounded
  • The last source can be either bounded or unbounded
  • SourceFactory enables lazy source construction with context from the previous source
  • SourceSwitchContext provides access to the previous enumerator for position derivation
  • Sources are indexed internally starting from 0

Step 2: Initialize First Source

When the HybridSource starts, it creates the first underlying source's SplitEnumerator and SourceReader. The HybridSourceSplitEnumerator wraps the first source's enumerator, intercepting split assignments and reader registrations. The HybridSourceReader wraps the underlying reader and tracks the current source index.

What happens:

  • The HybridSourceSplitEnumerator creates the first source's enumerator
  • All parallel HybridSourceReaders are initialized with the first source's reader
  • Splits from the first source are wrapped in HybridSourceSplit containers
  • The underlying reader begins processing its assigned splits

Step 3: Process Current Source Until Completion

The current source's reader processes its splits normally. As each reader finishes its assigned splits, it sends a SourceReaderFinishedEvent to the enumerator. The enumerator tracks which readers have completed the current source and waits for all readers to finish before initiating the switch.

Key considerations:

  • Readers process splits through the standard FLIP-27 source reader pipeline
  • SourceReaderFinishedEvent signals reader-level completion
  • The enumerator uses SupportsIntermediateNoMoreSplits to signal per-source end
  • All readers must finish the current source before switching occurs
  • Pending splits from the current source are preserved for recovery

Step 4: Switch to Next Source

When all readers have finished the current source, the HybridSourceSplitEnumerator closes the current source's enumerator and creates the next one. If a SourceFactory was provided, it receives the SourceSwitchContext containing the previous enumerator for start position derivation. A SwitchSourceEvent is sent to all readers to transition them to the new source.

What happens:

  • The current source's enumerator is closed
  • The next source is created (directly or via SourceFactory with switch context)
  • The new source's enumerator is initialized and started
  • SwitchSourceEvent (containing the new source index) is broadcast to all readers
  • Each HybridSourceReader creates a new underlying reader for the next source
  • Split assignment begins for the new source

Step 5: Handle Checkpoints Across Source Boundaries

The HybridSource maintains consistent checkpoint state across source switches. The HybridSourceEnumeratorState captures the current source index, the underlying enumerator's state, and any pending splits from previous sources. On recovery, the source chain is replayed up to the checkpointed source index, and processing resumes from the correct position.

Key considerations:

  • HybridSourceEnumeratorState wraps the underlying enumerator state with source index
  • HybridSourceSplit wraps underlying splits with source index for correct deserialization
  • Recovery replays source creation up to the checkpointed index
  • Pending splits from previous sources are preserved for subtask reset scenarios
  • SwitchedSources maintains a registry of source factories for recovery

Execution Diagram

GitHub URL

Workflow Repository