Implementation:Heibaiying BigData Notes DataStream Transformations
| Knowledge Sources | |
|---|---|
| Domains | Stream_Processing, Big_Data |
| Last Updated | 2026-02-10 10:00 GMT |
Overview
Concrete tools for applying transformations to Flink DataStreams provided by the Apache Flink DataStream API.
Description
The Flink DataStream API provides a set of transformation methods on DataStream objects that enable developers to express processing logic declaratively. Each transformation accepts a user-defined function (UDF) and returns a new DataStream with the transformation applied. The primary transformations are:
- map -- Takes a MapFunction<T, R> and produces a DataStream<R> with exactly one output per input.
- flatMap -- Takes a FlatMapFunction<T, R> and produces a DataStream<R> with zero or more outputs per input.
- filter -- Takes a FilterFunction<T> and produces a DataStream<T> containing only elements that satisfy the predicate.
- keyBy -- Takes a key selector (field index, field name, or KeySelector function) and produces a KeyedStream partitioned by key.
- window -- Applied on a KeyedStream, takes a WindowAssigner and produces a WindowedStream for windowed aggregations.
In the BigData-Notes repository, the KafkaStreamingJob demonstrates a map transformation on a Kafka-sourced stream, and the Flink_Data_Transformation notes provide comprehensive coverage of all available transformations with examples.
Usage
Chain these transformation methods on a DataStream object to build the processing pipeline. Transformations are applied lazily and only execute when env.execute() is called. Use lambda expressions or anonymous inner classes to define the transformation logic.
Code Reference
Source Location
- Repository: BigData-Notes
- File:
code/Flink/flink-kafka-integration/src/main/java/com/heibaiying/KafkaStreamingJob.java(line 42, map example) - File:
notes/Flink_Data_Transformation.md(lines 1-314, comprehensive transformation reference)
Signature
// map transformation
public <R> SingleOutputStreamOperator<R> map(MapFunction<T, R> mapper)
// flatMap transformation
public <R> SingleOutputStreamOperator<R> flatMap(FlatMapFunction<T, R> flatMapper)
// filter transformation
public SingleOutputStreamOperator<T> filter(FilterFunction<T> filter)
// keyBy transformation (by field index)
public KeyedStream<T, Tuple> keyBy(int... fields)
// keyBy transformation (by KeySelector)
public <K> KeyedStream<T, K> keyBy(KeySelector<T, K> key)
// window transformation (on KeyedStream)
public <W extends Window> WindowedStream<T, K, W> window(WindowAssigner<? super T, W> assigner)
Import
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| mapper | MapFunction<T, R> | Yes (for map) | A function that takes one element of type T and returns one element of type R. |
| flatMapper | FlatMapFunction<T, R> | Yes (for flatMap) | A function that takes one element of type T and emits zero or more elements of type R via a Collector. |
| filter | FilterFunction<T> | Yes (for filter) | A boolean predicate function; elements returning true are retained. |
| key | int or KeySelector<T, K> | Yes (for keyBy) | A field index or key extraction function that determines the grouping key. |
| assigner | WindowAssigner<? super T, W> | Yes (for window) | Defines the windowing strategy (tumbling, sliding, session, or global windows). |
Outputs
| Name | Type | Description |
|---|---|---|
| result (map) | DataStream<R> | A new stream where each element is the result of applying the map function to the corresponding input element. |
| result (flatMap) | DataStream<R> | A new stream containing all elements emitted by the flatMap function across all input elements. |
| result (filter) | DataStream<T> | A new stream containing only the input elements that satisfied the filter predicate. |
| result (keyBy) | KeyedStream<T, K> | A logically partitioned stream where all elements with the same key are directed to the same operator subtask. |
| result (window) | WindowedStream<T, K, W> | A windowed stream ready for aggregation operations (reduce, aggregate, apply). |
Usage Examples
Basic Usage
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
// Map: convert each line to uppercase
DataStream<String> upperCased = stream.map(line -> line.toUpperCase());
// FlatMap: split each line into words and emit (word, 1) tuples
DataStream<Tuple2<String, Integer>> wordCounts = stream
.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
@Override
public void flatMap(String line, Collector<Tuple2<String, Integer>> out) {
for (String word : line.split("\\s+")) {
out.collect(new Tuple2<>(word, 1));
}
}
});
// Filter: retain only non-empty strings
DataStream<String> nonEmpty = stream.filter(line -> !line.isEmpty());
// KeyBy + Window + Sum: word count over 5-second tumbling windows
DataStream<Tuple2<String, Integer>> windowedCounts = wordCounts
.keyBy(0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.sum(1);