Implementation:Apache Flink DataGeneratorSource
| Knowledge Sources | |
|---|---|
| Domains | Connectors, Source_Framework |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A data source that produces N data points in parallel by mapping a sequence of Long indices through a user-supplied GeneratorFunction.
Description
DataGeneratorSource is an experimental Flink Source implementation that generates bounded streams of data for testing and prototyping. It works by internally delegating to a NumberSequenceSource to produce a sequence of Long values from 0 to count-1, which are then transformed into output records by a user-supplied GeneratorFunction.
The source splits the index sequence into as many parallel sub-sequences as there are parallel source readers, enabling parallel data generation. Each sub-sequence is produced in order within its partition, ensuring deterministic output when parallelism is constrained to one.
Key architectural features:
- Delegation pattern: The source delegates split enumeration, checkpoint serialization, and enumerator restoration to an internal NumberSequenceSource, reusing the existing infrastructure for number sequence splitting.
- Rate limiting: Built-in support for rate limiting via RateLimiterStrategy, allowing control over the overall event production rate across all subtasks.
- Closure cleaning: The generator function, rate limiter strategy, and source reader factory are all cleaned via ClosureCleaner to ensure serializability.
- OutputTypeConfigurable: Supports late-binding of output type information, forwarding setOutputType calls to the generator function if it implements OutputTypeConfigurable.
- Boundedness: Always returns Boundedness.BOUNDED, though for very large counts (e.g., Long.MAX_VALUE) it can effectively serve as an unbounded stream.
Usage
Use DataGeneratorSource to generate synthetic data streams for testing, benchmarking, or prototyping Flink applications. It is the primary entry point for programmatic data generation in Flink's connector ecosystem and replaces older approaches like StreamExecutionEnvironment.generateSequence().
Code Reference
Source Location
- Repository: Apache_Flink
- File:
flink-connectors/flink-connector-datagen/src/main/java/org/apache/flink/connector/datagen/source/DataGeneratorSource.java - Lines: 1-219
Signature
@Experimental
public class DataGeneratorSource<OUT>
implements Source<OUT, NumberSequenceSplit, Collection<NumberSequenceSplit>>,
ResultTypeQueryable<OUT>,
OutputTypeConfigurable<OUT>
Import
import org.apache.flink.connector.datagen.source.DataGeneratorSource;
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| generatorFunction | GeneratorFunction<Long, OUT> | Yes | The mapping function that transforms Long indices into output records |
| count | long | Yes | The total number of data points to generate (0 to count-1) |
| rateLimiterStrategy | RateLimiterStrategy | No | Strategy for rate limiting event production (defaults to noOp) |
| typeInfo | TypeInformation<OUT> | Yes | Type information for the produced data points |
Outputs
| Name | Type | Description |
|---|---|---|
| records | OUT | The generated output records, produced by applying the generator function to Long indices |
| boundedness | Boundedness.BOUNDED | The source is always bounded |
Usage Examples
// Basic usage: generate 1000 string elements
GeneratorFunction<Long, String> generatorFunction = index -> "Number: " + index;
DataGeneratorSource<String> source =
new DataGeneratorSource<>(generatorFunction, 1000, Types.STRING);
DataStreamSource<String> stream =
env.fromSource(source, WatermarkStrategy.noWatermarks(), "Generator Source");
// With rate limiting: generate an effectively unbounded stream at 100 events/sec
GeneratorFunction<Long, String> generatorFunction = index -> "Number: " + index;
DataGeneratorSource<String> rateLimitedSource =
new DataGeneratorSource<>(
generatorFunction,
Long.MAX_VALUE,
RateLimiterStrategy.perSecond(100),
Types.STRING);