Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:DataExpert io Data engineer handbook Flink Kafka Docker Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Stream_Processing
Last Updated 2026-02-09 06:00 GMT

Overview

Docker Compose environment with Apache Flink 1.16.2 (JobManager + TaskManager), Confluent Cloud Kafka, and PostgreSQL for stream processing exercises.

Description

This environment provides a containerized Apache Flink cluster with a custom-built Docker image (`eczachly-pyflink`) based on `flink:1.16.2`. It includes Python 3.7.9 (required by PyFlink 1.16.2), Java 11, and pre-downloaded connector JARs for Kafka, JDBC, and PostgreSQL. The cluster connects to an external Confluent Cloud Kafka instance for real-time event ingestion and writes processed results to a local PostgreSQL database. The TaskManager is configured with 15 task slots and default parallelism of 3.

Usage

Use this environment for any Flink Streaming workflow that requires a local Flink cluster connected to Kafka and PostgreSQL. It is the mandatory prerequisite for running the Create_events_source_kafka, Create_processed_events_sink_postgres, GetLocation_UDF, Log_processing, Log_aggregation, and Tumble_Over_Window implementations.

System Requirements

Category Requirement Notes
OS Linux, macOS, or Windows with WSL2 Docker Desktop required
Platform linux/amd64 Explicitly set in docker-compose; ARM users need emulation
Software Docker Engine + Docker Compose v2.x recommended
Software PostgreSQL 14 Local instance on host at port 5432 (or via Docker)
Memory Minimum 4GB allocated to Docker Flink JobManager + TaskManager
Disk ~2GB Custom Docker image with Python, Java, and Flink JARs
Network Port 8081 available Flink Web UI

Dependencies

System Packages (inside Docker image)

  • Apache Flink = 1.16.2 (base image: `flink:1.16.2`)
  • Python = 3.7.9 (built from source; PyFlink 1.16.2 requires Python 3.6-3.8)
  • Java = OpenJDK 11 (`openjdk-11-jdk`)
  • Build tools: `build-essential`, `libssl-dev`, `zlib1g-dev`, `libbz2-dev`, `libffi-dev`, `liblzma-dev`

Python Packages

  • `apache-flink` == 1.16.2
  • `psycopg2-binary` == 2.9.1
  • `requests`

Java Connector JARs

  • `flink-python-1.16.2.jar`
  • `flink-sql-connector-kafka-1.16.2.jar`
  • `flink-connector-jdbc-1.16.2.jar`
  • `postgresql-42.2.26.jar`

Credentials

The following environment variables must be set in `flink-env.env` (copy from `example.env`):

  • `KAFKA_WEB_TRAFFIC_KEY`: Confluent Cloud API key for Kafka authentication
  • `KAFKA_WEB_TRAFFIC_SECRET`: Confluent Cloud API secret for Kafka authentication
  • `IP_CODING_KEY`: API key for ip2location.io geolocation service (requires free account at https://www.ip2location.io/)
  • `KAFKA_GROUP`: Kafka consumer group name (default: `web-events`)
  • `KAFKA_TOPIC`: Kafka topic to consume from (default: `bootcamp-events-prod`)
  • `KAFKA_URL`: Kafka bootstrap servers (default: `pkc-rgm37.us-west-2.aws.confluent.cloud:9092`)
  • `POSTGRES_URL`: JDBC connection string (default: `jdbc:postgresql://host.docker.internal:5432/postgres`)
  • `POSTGRES_USER`: PostgreSQL username (default: `postgres`)
  • `POSTGRES_PASSWORD`: PostgreSQL password (default: `postgres`)
  • `POSTGRES_DB`: PostgreSQL database name (default: `postgres`)

Quick Install

# Copy environment configuration
cp example.env flink-env.env
# Edit flink-env.env to add your Kafka and IP2Location credentials

# Build the custom Flink image (first build takes 5-30 minutes)
docker compose build

# Start Flink JobManager and TaskManager
docker compose up -d

# Wait for Flink UI at http://localhost:8081
# Look for registration message in logs:
docker compose logs -f taskmanager

Code Evidence

Dockerfile Python version constraint from `Dockerfile:3-12`:

FROM --platform=linux/amd64 flink:1.16.2

# Python 3.7.9 installation
# Flink 1.16.2 only officially supports Python 3.6, 3.7, 3.8
RUN apt-get update -y && \
apt-get install -y build-essential libssl-dev zlib1g-dev libbz2-dev libffi-dev liblzma-dev && \
wget https://www.python.org/ftp/python/3.7.9/Python-3.7.9.tgz && \
tar -xvf Python-3.7.9.tgz && \
cd Python-3.7.9 && \
./configure --without-tests --enable-shared && \
make -j6 && \
make install

Kafka SASL/SSL authentication from `start_job.py:83-113`:

kafka_key = os.environ.get("KAFKA_WEB_TRAFFIC_KEY", "")
kafka_secret = os.environ.get("KAFKA_WEB_TRAFFIC_SECRET", "")

# SASL_SSL for Confluent Cloud
'properties.security.protocol' = 'SASL_SSL',
'properties.sasl.mechanism' = 'PLAIN',
'properties.sasl.jaas.config' = '...PlainLoginModule required username=... password=...;'

TaskManager configuration from `docker-compose.yml:33-54`:

environment:
  - |
    FLINK_PROPERTIES=
    jobmanager.rpc.address: jobmanager
    taskmanager.numberOfTaskSlots: 15
    parallelism.default: 3

Common Errors

Error Message Cause Solution
First build takes 5-30 minutes Python compiled from source inside Docker This is normal for the first build; subsequent builds use cache
`Connection refused` on port 8081 Flink cluster not fully initialized Wait for TaskManager registration message in logs
`SASL authentication failed` Invalid Kafka credentials Verify `KAFKA_WEB_TRAFFIC_KEY` and `KAFKA_WEB_TRAFFIC_SECRET` in `flink-env.env`
`Connection to host.docker.internal refused` PostgreSQL not running on host Start local PostgreSQL on port 5432 or adjust `POSTGRES_URL`

Compatibility Notes

  • Platform: The Docker image is built for `linux/amd64` only. Apple Silicon (M1/M2) users require Docker Desktop's Rosetta emulation.
  • Python Version: PyFlink 1.16.2 requires Python 3.6, 3.7, or 3.8. Python 3.9+ is not supported and will cause import failures.
  • Host Networking: Uses `host.docker.internal` to connect from containers to host-side PostgreSQL. This works on Docker Desktop but may require `extra_hosts` configuration on Linux.
  • Credential Security: The `flink-env.env` file contains API secrets and must not be committed to version control.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment