Principle:Apache Flink Data Generation

Knowledge Sources	Apache_Flink
Domains	Data_Generation, Testing
Last Updated	2026-02-09 00:00 GMT

Overview

Description

The Data Generation Framework provides a built-in source connector for producing synthetic or test data streams in Apache Flink. Located in the flink-connector-datagen module under the org.apache.flink.connector.datagen.source package, the framework is built on top of the FLIP-27 Source API and centers on the GeneratorFunction interface and the DataGeneratorSource class.

The key abstractions are:

GeneratorFunction -- An @Experimental functional interface extending Function that maps an input value of type T to an output value of type O via the map(T value) method. It includes lifecycle hooks: open(SourceReaderContext) for initialization and close() for teardown. In the standard usage, the input type is Long (a sequential index) and the output type is the desired record type.
DataGeneratorSource -- An @Experimental class implementing Source<OUT, NumberSequenceSplit, Collection<NumberSequenceSplit>> that drives the generation process. It delegates split enumeration to an internal NumberSequenceSource that partitions a range of [0, count-1] across parallel readers. Each reader applies the user-supplied GeneratorFunction to transform sequential Long indices into output records. The source supports optional RateLimiterStrategy for controlling throughput.
FromElementsGeneratorFunction -- A convenience implementation that maps indices to elements from a pre-defined collection, cycling through them in order.
IndexLookupGeneratorFunction -- A specialized implementation that retrieves output elements based on the index value, typically used for deterministic data generation in testing scenarios.

Theoretical Basis

The Data Generation Framework applies the Strategy pattern through the GeneratorFunction interface. The DataGeneratorSource itself is invariant in its split management and enumeration logic (delegated to NumberSequenceSource), while the data transformation strategy is externalized into the pluggable GeneratorFunction. This separation allows the same source infrastructure to produce arbitrary record types simply by swapping the generator function.

The framework leverages index-based determinism: because each parallel reader receives a disjoint sub-range of Long indices, and the GeneratorFunction is a pure function of its index argument, the generated data is fully deterministic and reproducible regardless of parallelism or execution order. This property is valuable for testing, benchmarking, and producing repeatable synthetic datasets.

The source is always bounded (returns Boundedness.BOUNDED), but with a count of Long.MAX_VALUE it can be used as an effectively unbounded stream. Rate limiting is applied through a composable RateLimiterStrategy, which controls the throughput at the source level without requiring downstream backpressure. Closure cleaning via ClosureCleaner is applied to both the generator function and the rate limiter strategy to ensure clean serialization in distributed environments.

// Example: generating a bounded stream of 1000 string records
GeneratorFunction<Long, String> generatorFunction = index -> "Number: " + index;

DataGeneratorSource<String> source =
        new DataGeneratorSource<>(generatorFunction, 1000, Types.STRING);

DataStreamSource<String> stream =
        env.fromSource(source, WatermarkStrategy.noWatermarks(), "Generator Source");

Component	Type	Purpose
`GeneratorFunction<T, O>`	Interface	Maps index values to output records
`DataGeneratorSource<OUT>`	Class	FLIP-27 source driving parallel data generation
`FromElementsGeneratorFunction`	Class	Generator cycling through a predefined collection
`IndexLookupGeneratorFunction`	Class	Generator performing index-based lookups

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment