Principle:Apache Flink Data Generation
| Knowledge Sources | |
|---|---|
| Domains | Data_Generation, Testing |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Description
The Data Generation Framework provides a built-in source connector for producing synthetic or test data streams in Apache Flink. Located in the flink-connector-datagen module under the org.apache.flink.connector.datagen.source package, the framework is built on top of the FLIP-27 Source API and centers on the GeneratorFunction interface and the DataGeneratorSource class.
The key abstractions are:
- GeneratorFunction -- An
@Experimentalfunctional interface extendingFunctionthat maps an input value of typeTto an output value of typeOvia themap(T value)method. It includes lifecycle hooks:open(SourceReaderContext)for initialization andclose()for teardown. In the standard usage, the input type isLong(a sequential index) and the output type is the desired record type. - DataGeneratorSource -- An
@Experimentalclass implementingSource<OUT, NumberSequenceSplit, Collection<NumberSequenceSplit>>that drives the generation process. It delegates split enumeration to an internalNumberSequenceSourcethat partitions a range of[0, count-1]across parallel readers. Each reader applies the user-suppliedGeneratorFunctionto transform sequential Long indices into output records. The source supports optionalRateLimiterStrategyfor controlling throughput. - FromElementsGeneratorFunction -- A convenience implementation that maps indices to elements from a pre-defined collection, cycling through them in order.
- IndexLookupGeneratorFunction -- A specialized implementation that retrieves output elements based on the index value, typically used for deterministic data generation in testing scenarios.
Theoretical Basis
The Data Generation Framework applies the Strategy pattern through the GeneratorFunction interface. The DataGeneratorSource itself is invariant in its split management and enumeration logic (delegated to NumberSequenceSource), while the data transformation strategy is externalized into the pluggable GeneratorFunction. This separation allows the same source infrastructure to produce arbitrary record types simply by swapping the generator function.
The framework leverages index-based determinism: because each parallel reader receives a disjoint sub-range of Long indices, and the GeneratorFunction is a pure function of its index argument, the generated data is fully deterministic and reproducible regardless of parallelism or execution order. This property is valuable for testing, benchmarking, and producing repeatable synthetic datasets.
The source is always bounded (returns Boundedness.BOUNDED), but with a count of Long.MAX_VALUE it can be used as an effectively unbounded stream. Rate limiting is applied through a composable RateLimiterStrategy, which controls the throughput at the source level without requiring downstream backpressure. Closure cleaning via ClosureCleaner is applied to both the generator function and the rate limiter strategy to ensure clean serialization in distributed environments.
// Example: generating a bounded stream of 1000 string records
GeneratorFunction<Long, String> generatorFunction = index -> "Number: " + index;
DataGeneratorSource<String> source =
new DataGeneratorSource<>(generatorFunction, 1000, Types.STRING);
DataStreamSource<String> stream =
env.fromSource(source, WatermarkStrategy.noWatermarks(), "Generator Source");
| Component | Type | Purpose |
|---|---|---|
GeneratorFunction<T, O> |
Interface | Maps index values to output records |
DataGeneratorSource<OUT> |
Class | FLIP-27 source driving parallel data generation |
FromElementsGeneratorFunction |
Class | Generator cycling through a predefined collection |
IndexLookupGeneratorFunction |
Class | Generator performing index-based lookups |