Environment:DataExpert io Data engineer handbook Flink Kafka Docker Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Stream_Processing |
| Last Updated | 2026-02-09 06:00 GMT |
Overview
Docker Compose environment with Apache Flink 1.16.2 (JobManager + TaskManager), Confluent Cloud Kafka, and PostgreSQL for stream processing exercises.
Description
This environment provides a containerized Apache Flink cluster with a custom-built Docker image (`eczachly-pyflink`) based on `flink:1.16.2`. It includes Python 3.7.9 (required by PyFlink 1.16.2), Java 11, and pre-downloaded connector JARs for Kafka, JDBC, and PostgreSQL. The cluster connects to an external Confluent Cloud Kafka instance for real-time event ingestion and writes processed results to a local PostgreSQL database. The TaskManager is configured with 15 task slots and default parallelism of 3.
Usage
Use this environment for any Flink Streaming workflow that requires a local Flink cluster connected to Kafka and PostgreSQL. It is the mandatory prerequisite for running the Create_events_source_kafka, Create_processed_events_sink_postgres, GetLocation_UDF, Log_processing, Log_aggregation, and Tumble_Over_Window implementations.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux, macOS, or Windows with WSL2 | Docker Desktop required |
| Platform | linux/amd64 | Explicitly set in docker-compose; ARM users need emulation |
| Software | Docker Engine + Docker Compose | v2.x recommended |
| Software | PostgreSQL 14 | Local instance on host at port 5432 (or via Docker) |
| Memory | Minimum 4GB allocated to Docker | Flink JobManager + TaskManager |
| Disk | ~2GB | Custom Docker image with Python, Java, and Flink JARs |
| Network | Port 8081 available | Flink Web UI |
Dependencies
System Packages (inside Docker image)
- Apache Flink = 1.16.2 (base image: `flink:1.16.2`)
- Python = 3.7.9 (built from source; PyFlink 1.16.2 requires Python 3.6-3.8)
- Java = OpenJDK 11 (`openjdk-11-jdk`)
- Build tools: `build-essential`, `libssl-dev`, `zlib1g-dev`, `libbz2-dev`, `libffi-dev`, `liblzma-dev`
Python Packages
- `apache-flink` == 1.16.2
- `psycopg2-binary` == 2.9.1
- `requests`
Java Connector JARs
- `flink-python-1.16.2.jar`
- `flink-sql-connector-kafka-1.16.2.jar`
- `flink-connector-jdbc-1.16.2.jar`
- `postgresql-42.2.26.jar`
Credentials
The following environment variables must be set in `flink-env.env` (copy from `example.env`):
- `KAFKA_WEB_TRAFFIC_KEY`: Confluent Cloud API key for Kafka authentication
- `KAFKA_WEB_TRAFFIC_SECRET`: Confluent Cloud API secret for Kafka authentication
- `IP_CODING_KEY`: API key for ip2location.io geolocation service (requires free account at https://www.ip2location.io/)
- `KAFKA_GROUP`: Kafka consumer group name (default: `web-events`)
- `KAFKA_TOPIC`: Kafka topic to consume from (default: `bootcamp-events-prod`)
- `KAFKA_URL`: Kafka bootstrap servers (default: `pkc-rgm37.us-west-2.aws.confluent.cloud:9092`)
- `POSTGRES_URL`: JDBC connection string (default: `jdbc:postgresql://host.docker.internal:5432/postgres`)
- `POSTGRES_USER`: PostgreSQL username (default: `postgres`)
- `POSTGRES_PASSWORD`: PostgreSQL password (default: `postgres`)
- `POSTGRES_DB`: PostgreSQL database name (default: `postgres`)
Quick Install
# Copy environment configuration
cp example.env flink-env.env
# Edit flink-env.env to add your Kafka and IP2Location credentials
# Build the custom Flink image (first build takes 5-30 minutes)
docker compose build
# Start Flink JobManager and TaskManager
docker compose up -d
# Wait for Flink UI at http://localhost:8081
# Look for registration message in logs:
docker compose logs -f taskmanager
Code Evidence
Dockerfile Python version constraint from `Dockerfile:3-12`:
FROM --platform=linux/amd64 flink:1.16.2
# Python 3.7.9 installation
# Flink 1.16.2 only officially supports Python 3.6, 3.7, 3.8
RUN apt-get update -y && \
apt-get install -y build-essential libssl-dev zlib1g-dev libbz2-dev libffi-dev liblzma-dev && \
wget https://www.python.org/ftp/python/3.7.9/Python-3.7.9.tgz && \
tar -xvf Python-3.7.9.tgz && \
cd Python-3.7.9 && \
./configure --without-tests --enable-shared && \
make -j6 && \
make install
Kafka SASL/SSL authentication from `start_job.py:83-113`:
kafka_key = os.environ.get("KAFKA_WEB_TRAFFIC_KEY", "")
kafka_secret = os.environ.get("KAFKA_WEB_TRAFFIC_SECRET", "")
# SASL_SSL for Confluent Cloud
'properties.security.protocol' = 'SASL_SSL',
'properties.sasl.mechanism' = 'PLAIN',
'properties.sasl.jaas.config' = '...PlainLoginModule required username=... password=...;'
TaskManager configuration from `docker-compose.yml:33-54`:
environment:
- |
FLINK_PROPERTIES=
jobmanager.rpc.address: jobmanager
taskmanager.numberOfTaskSlots: 15
parallelism.default: 3
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| First build takes 5-30 minutes | Python compiled from source inside Docker | This is normal for the first build; subsequent builds use cache |
| `Connection refused` on port 8081 | Flink cluster not fully initialized | Wait for TaskManager registration message in logs |
| `SASL authentication failed` | Invalid Kafka credentials | Verify `KAFKA_WEB_TRAFFIC_KEY` and `KAFKA_WEB_TRAFFIC_SECRET` in `flink-env.env` |
| `Connection to host.docker.internal refused` | PostgreSQL not running on host | Start local PostgreSQL on port 5432 or adjust `POSTGRES_URL` |
Compatibility Notes
- Platform: The Docker image is built for `linux/amd64` only. Apple Silicon (M1/M2) users require Docker Desktop's Rosetta emulation.
- Python Version: PyFlink 1.16.2 requires Python 3.6, 3.7, or 3.8. Python 3.9+ is not supported and will cause import failures.
- Host Networking: Uses `host.docker.internal` to connect from containers to host-side PostgreSQL. This works on Docker Desktop but may require `extra_hosts` configuration on Linux.
- Credential Security: The `flink-env.env` file contains API secrets and must not be committed to version control.
Related Pages
- Implementation:DataExpert_io_Data_engineer_handbook_Create_events_source_kafka
- Implementation:DataExpert_io_Data_engineer_handbook_Create_processed_events_sink_postgres
- Implementation:DataExpert_io_Data_engineer_handbook_GetLocation_UDF
- Implementation:DataExpert_io_Data_engineer_handbook_Log_processing
- Implementation:DataExpert_io_Data_engineer_handbook_Log_aggregation
- Implementation:DataExpert_io_Data_engineer_handbook_Tumble_Over_Window