Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:DataTalksClub Data engineering zoomcamp Docker Build Run

From Leeroopedia


Metadata
Knowledge Sources DataTalksClub/data-engineering-zoomcamp
Domains Docker, Container Build, Container Networking, Pipeline Execution
Last Updated 2026-02-09 14:00 GMT

Overview

Concrete tool for building the taxi data ingestion pipeline into a Docker image and running it on a shared Docker network to communicate with the PostgreSQL container.

Description

This implementation consists of two steps: building the Docker image using docker build and running the container using docker run with appropriate network and parameter flags.

The Dockerfile uses a multi-stage approach:

  1. Starts from python:3.13.11-slim as the base image.
  2. Copies the uv package manager from its official image.
  3. Copies the dependency manifest (pyproject.toml, .python-version, uv.lock) and installs locked dependencies via uv sync --locked.
  4. Copies the pipeline script (ingest_data.py).
  5. Sets the entry point to python ingest_data.py.

The run command (from the helper script docker-ingest.sh) joins the container to the pg-network Docker network and passes all pipeline CLI parameters. Critically, the --pg-host parameter is set to pgdatabase (the Compose service name) rather than localhost, because the pipeline container and database container are separate network namespaces on the same Docker bridge network.

Usage

Use this implementation after the Docker Compose stack (PostgreSQL + pgAdmin) is running. Build the image once, then run it with different year/month parameters to ingest different data partitions. The container is ephemeral (--rm flag) and removes itself after completion.

Code Reference

Source Location: 01-docker-terraform/docker-sql/pipeline/docker-helper-scripts/docker-ingest.sh:L1-17

Dockerfile: 01-docker-terraform/docker-sql/pipeline/Dockerfile

Dockerfile Signature:

FROM python:3.13.11-slim
COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/

WORKDIR /code
ENV PATH="/code/.venv/bin:$PATH"

COPY pyproject.toml .python-version uv.lock ./
RUN uv sync --locked

COPY ingest_data.py .

ENTRYPOINT ["python", "ingest_data.py"]

Run Script Signature:

#!/usr/bin/env bash

## bash script to run the ingestion container
echo "Running data ingestion for January 2021..."

docker run -it --rm \
  --network=pg-network \
  taxi_ingest:v001 \
  --year=2021 \
  --month=1 \
  --pg-user=root \
  --pg-pass=root \
  --pg-host=pgdatabase \
  --pg-port=5432 \
  --pg-db=ny_taxi \
  --chunksize=100000 \
  --target-table=yellow_taxi_trips

Import: N/A (external tool, requires Docker Engine on the host)

I/O Contract

Inputs

Name Type Description
Dockerfile File Dockerfile in the pipeline directory defining the image build steps
pyproject.toml, uv.lock, .python-version Files Dependency manifest and lock file for reproducible installs
ingest_data.py File The pipeline script to be packaged into the image
Docker Engine Runtime A running Docker daemon on the host machine
pg-network Docker Network A Docker bridge network that the PostgreSQL container is also attached to
CLI parameters Various --year, --month, --pg-user, --pg-pass, --pg-host, --pg-port, --pg-db, --chunksize, --target-table

Outputs

Name Type Description
taxi_ingest:v001 Docker Image Built image containing the pipeline code and all dependencies (output of docker build)
Data in PostgreSQL Database rows The specified monthly taxi data loaded into the target table in the ny_taxi database (output of docker run)

Usage Examples

Building the Docker image:

cd 01-docker-terraform/docker-sql/pipeline

# Build the image and tag it
docker build -t taxi_ingest:v001 .

Creating the Docker network (if not using Compose default network):

# Create a custom bridge network
docker network create pg-network

# If using docker-compose, the network is created automatically
# with the name <project>_default

Running the ingestion for January 2021:

docker run -it --rm \
  --network=pg-network \
  taxi_ingest:v001 \
  --year=2021 \
  --month=1 \
  --pg-user=root \
  --pg-pass=root \
  --pg-host=pgdatabase \
  --pg-port=5432 \
  --pg-db=ny_taxi \
  --chunksize=100000 \
  --target-table=yellow_taxi_trips

Running the ingestion for a different month:

docker run -it --rm \
  --network=pg-network \
  taxi_ingest:v001 \
  --year=2021 \
  --month=7 \
  --pg-user=root \
  --pg-pass=root \
  --pg-host=pgdatabase \
  --pg-port=5432 \
  --pg-db=ny_taxi \
  --target-table=yellow_taxi_trips_july

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment