Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:DataExpert io Data engineer handbook Docker Environment Setup

From Leeroopedia


Overview

Docker Environment Setup encompasses the theory and best practices for creating reproducible, containerized development environments using Docker and Docker Compose. In the context of the Dimensional Data Modeling workflow, this principle covers how PostgreSQL databases, administrative tools, and initialization scripts are orchestrated within containers to provide a consistent environment for data engineering exercises.

Theoretical Foundation

Containerization for Data Engineering

Containers solve the works on my machine problem by packaging an application and all its dependencies into a single, portable unit. For data engineering workflows, containerization provides:

  • Environment consistency -- Every engineer works against the same PostgreSQL version, configuration, and seed data.
  • Isolation -- The containerized database does not interfere with other databases or services on the host machine.
  • Reproducibility -- Destroying and recreating the environment yields an identical starting state every time.
  • Portability -- The same Docker Compose file works across Linux, macOS, and Windows hosts.

Docker Compose for Multi-Service Orchestration

Docker Compose allows multiple interrelated services to be defined, configured, and launched from a single YAML file. In a typical data engineering development stack, these services include:

  • Database service -- A PostgreSQL instance that stores the working dataset.
  • Administration service -- A tool like PGAdmin that provides a graphical interface for inspecting and querying the database.
  • Initialization service -- Scripts or sidecar containers that seed the database on first startup.

Docker Compose handles:

  • Service dependencies -- Ensuring the database is healthy before dependent services attempt to connect.
  • Network creation -- Automatically creating a shared network so services can communicate by name.
  • Lifecycle management -- Starting, stopping, and destroying the entire stack with single commands.
docker compose up    --> Start all services
docker compose down  --> Stop and remove all services
docker compose down -v  --> Also remove persistent volumes

Volume Mounting for Data Persistence

Docker volumes provide a mechanism for persisting data beyond the lifecycle of a container. There are two primary volume strategies relevant to data engineering:

Named Volumes

Named volumes are managed by Docker and persist independently of any container. They are ideal for:

  • Database storage -- PostgreSQL's data directory (/var/lib/postgresql/data) is stored in a named volume so that data survives container restarts.
  • Shared state -- Multiple containers can mount the same named volume to share data.

Bind Mounts

Bind mounts map a specific host directory to a container path. They are ideal for:

  • Initialization scripts -- Mounting the host's scripts/ directory into the container's /docker-entrypoint-initdb.d/ so that seeding scripts and dump files are available at startup.
  • Homework files -- Mounting SQL exercise files into the container so they can be executed during initialization.
  • Configuration files -- Mounting custom PostgreSQL configuration files without rebuilding the image.
Volume Type Managed By Persists After docker compose down Use Case
Named Volume Docker Yes (unless -v flag used) Database data storage
Bind Mount Host filesystem Always (host files) Scripts, config, homework files

Environment Variable Injection via .env Files

Environment variables decouple configuration from code. In Docker Compose, a .env file placed alongside the docker-compose.yml file is automatically loaded and its variables are available for interpolation.

This approach provides several benefits:

  • Security -- Sensitive values (passwords, API keys) are kept out of version-controlled files. The .env file can be added to .gitignore.
  • Flexibility -- Different environments (development, testing, CI) can use different .env files without modifying the Compose file.
  • Clarity -- All configurable parameters are centralized in one file, making it easy to understand what can be customized.

Common environment variables in the Dimensional Data Modeling stack include:

Variable Purpose Example Value
POSTGRES_USER Database superuser name postgres
POSTGRES_PASSWORD Database superuser password postgres
POSTGRES_DB Default database name postgres
HOST_PORT Host port mapped to PostgreSQL's port 5432 5433
PGADMIN_EMAIL PGAdmin login email admin@admin.com
PGADMIN_PASSWORD PGAdmin login password admin
PGADMIN_PORT Host port mapped to PGAdmin's port 80 5050

Design Principles

Least Privilege

Containers should run with the minimum permissions necessary. The --no-owner and --no-privileges flags in database restoration scripts reflect this principle by avoiding the creation of roles that do not exist in the container environment.

Declarative Configuration

The entire environment is declared in configuration files (docker-compose.yml, .env) rather than being imperatively constructed through a series of manual commands. This makes the environment self-documenting and version-controllable.

Fail-Fast Initialization

Initialization scripts use set -e to exit immediately on error. This ensures that a failed seeding process does not result in a partially initialized database that could produce confusing errors during exercises.

Separation of Concerns

Each service in the Docker Compose stack has a single responsibility:

  • The postgres service runs the database.
  • The pgadmin service provides the administrative interface.
  • The init scripts handle data seeding.

This separation makes it easy to modify, replace, or debug individual components without affecting the others.

Related Pages

Metadata

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment