Implementation:Apache Flink DataGeneratorSource

Knowledge Sources	Apache_Flink
Domains	Connectors, Source_Framework
Last Updated	2026-02-09 00:00 GMT

Overview

A data source that produces N data points in parallel by mapping a sequence of Long indices through a user-supplied GeneratorFunction.

Description

DataGeneratorSource is an experimental Flink Source implementation that generates bounded streams of data for testing and prototyping. It works by internally delegating to a NumberSequenceSource to produce a sequence of Long values from 0 to count-1, which are then transformed into output records by a user-supplied GeneratorFunction.

The source splits the index sequence into as many parallel sub-sequences as there are parallel source readers, enabling parallel data generation. Each sub-sequence is produced in order within its partition, ensuring deterministic output when parallelism is constrained to one.

Key architectural features:

Delegation pattern: The source delegates split enumeration, checkpoint serialization, and enumerator restoration to an internal NumberSequenceSource, reusing the existing infrastructure for number sequence splitting.
Rate limiting: Built-in support for rate limiting via RateLimiterStrategy, allowing control over the overall event production rate across all subtasks.
Closure cleaning: The generator function, rate limiter strategy, and source reader factory are all cleaned via ClosureCleaner to ensure serializability.
OutputTypeConfigurable: Supports late-binding of output type information, forwarding setOutputType calls to the generator function if it implements OutputTypeConfigurable.
Boundedness: Always returns Boundedness.BOUNDED, though for very large counts (e.g., Long.MAX_VALUE) it can effectively serve as an unbounded stream.

Usage

Use DataGeneratorSource to generate synthetic data streams for testing, benchmarking, or prototyping Flink applications. It is the primary entry point for programmatic data generation in Flink's connector ecosystem and replaces older approaches like StreamExecutionEnvironment.generateSequence().

Code Reference

Source Location

Repository: Apache_Flink
File: flink-connectors/flink-connector-datagen/src/main/java/org/apache/flink/connector/datagen/source/DataGeneratorSource.java
Lines: 1-219

Signature

@Experimental
public class DataGeneratorSource<OUT>
        implements Source<OUT, NumberSequenceSplit, Collection<NumberSequenceSplit>>,
                ResultTypeQueryable<OUT>,
                OutputTypeConfigurable<OUT>

Import

import org.apache.flink.connector.datagen.source.DataGeneratorSource;

I/O Contract

Inputs

Name	Type	Required	Description
generatorFunction	GeneratorFunction<Long, OUT>	Yes	The mapping function that transforms Long indices into output records
count	long	Yes	The total number of data points to generate (0 to count-1)
rateLimiterStrategy	RateLimiterStrategy	No	Strategy for rate limiting event production (defaults to noOp)
typeInfo	TypeInformation<OUT>	Yes	Type information for the produced data points

Outputs

Name	Type	Description
records	OUT	The generated output records, produced by applying the generator function to Long indices
boundedness	Boundedness.BOUNDED	The source is always bounded

Usage Examples

// Basic usage: generate 1000 string elements
GeneratorFunction<Long, String> generatorFunction = index -> "Number: " + index;

DataGeneratorSource<String> source =
        new DataGeneratorSource<>(generatorFunction, 1000, Types.STRING);

DataStreamSource<String> stream =
        env.fromSource(source, WatermarkStrategy.noWatermarks(), "Generator Source");

// With rate limiting: generate an effectively unbounded stream at 100 events/sec
GeneratorFunction<Long, String> generatorFunction = index -> "Number: " + index;

DataGeneratorSource<String> rateLimitedSource =
        new DataGeneratorSource<>(
                generatorFunction,
                Long.MAX_VALUE,
                RateLimiterStrategy.perSecond(100),
                Types.STRING);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment