Implementation:DataTalksClub Data engineering zoomcamp Docker Build Run
| Metadata | |
|---|---|
| Knowledge Sources | DataTalksClub/data-engineering-zoomcamp |
| Domains | Docker, Container Build, Container Networking, Pipeline Execution |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
Concrete tool for building the taxi data ingestion pipeline into a Docker image and running it on a shared Docker network to communicate with the PostgreSQL container.
Description
This implementation consists of two steps: building the Docker image using docker build and running the container using docker run with appropriate network and parameter flags.
The Dockerfile uses a multi-stage approach:
- Starts from
python:3.13.11-slimas the base image. - Copies the
uvpackage manager from its official image. - Copies the dependency manifest (
pyproject.toml,.python-version,uv.lock) and installs locked dependencies viauv sync --locked. - Copies the pipeline script (
ingest_data.py). - Sets the entry point to
python ingest_data.py.
The run command (from the helper script docker-ingest.sh) joins the container to the pg-network Docker network and passes all pipeline CLI parameters. Critically, the --pg-host parameter is set to pgdatabase (the Compose service name) rather than localhost, because the pipeline container and database container are separate network namespaces on the same Docker bridge network.
Usage
Use this implementation after the Docker Compose stack (PostgreSQL + pgAdmin) is running. Build the image once, then run it with different year/month parameters to ingest different data partitions. The container is ephemeral (--rm flag) and removes itself after completion.
Code Reference
Source Location: 01-docker-terraform/docker-sql/pipeline/docker-helper-scripts/docker-ingest.sh:L1-17
Dockerfile: 01-docker-terraform/docker-sql/pipeline/Dockerfile
Dockerfile Signature:
FROM python:3.13.11-slim
COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/
WORKDIR /code
ENV PATH="/code/.venv/bin:$PATH"
COPY pyproject.toml .python-version uv.lock ./
RUN uv sync --locked
COPY ingest_data.py .
ENTRYPOINT ["python", "ingest_data.py"]
Run Script Signature:
#!/usr/bin/env bash
## bash script to run the ingestion container
echo "Running data ingestion for January 2021..."
docker run -it --rm \
--network=pg-network \
taxi_ingest:v001 \
--year=2021 \
--month=1 \
--pg-user=root \
--pg-pass=root \
--pg-host=pgdatabase \
--pg-port=5432 \
--pg-db=ny_taxi \
--chunksize=100000 \
--target-table=yellow_taxi_trips
Import: N/A (external tool, requires Docker Engine on the host)
I/O Contract
Inputs
| Name | Type | Description |
|---|---|---|
Dockerfile |
File | Dockerfile in the pipeline directory defining the image build steps |
pyproject.toml, uv.lock, .python-version |
Files | Dependency manifest and lock file for reproducible installs |
ingest_data.py |
File | The pipeline script to be packaged into the image |
| Docker Engine | Runtime | A running Docker daemon on the host machine |
pg-network |
Docker Network | A Docker bridge network that the PostgreSQL container is also attached to |
| CLI parameters | Various | --year, --month, --pg-user, --pg-pass, --pg-host, --pg-port, --pg-db, --chunksize, --target-table
|
Outputs
| Name | Type | Description |
|---|---|---|
taxi_ingest:v001 |
Docker Image | Built image containing the pipeline code and all dependencies (output of docker build)
|
| Data in PostgreSQL | Database rows | The specified monthly taxi data loaded into the target table in the ny_taxi database (output of docker run)
|
Usage Examples
Building the Docker image:
cd 01-docker-terraform/docker-sql/pipeline
# Build the image and tag it
docker build -t taxi_ingest:v001 .
Creating the Docker network (if not using Compose default network):
# Create a custom bridge network
docker network create pg-network
# If using docker-compose, the network is created automatically
# with the name <project>_default
Running the ingestion for January 2021:
docker run -it --rm \
--network=pg-network \
taxi_ingest:v001 \
--year=2021 \
--month=1 \
--pg-user=root \
--pg-pass=root \
--pg-host=pgdatabase \
--pg-port=5432 \
--pg-db=ny_taxi \
--chunksize=100000 \
--target-table=yellow_taxi_trips
Running the ingestion for a different month:
docker run -it --rm \
--network=pg-network \
taxi_ingest:v001 \
--year=2021 \
--month=7 \
--pg-user=root \
--pg-pass=root \
--pg-host=pgdatabase \
--pg-port=5432 \
--pg-db=ny_taxi \
--target-table=yellow_taxi_trips_july
Related Pages
- Principle:DataTalksClub_Data_engineering_zoomcamp_Pipeline_Containerization
- Implementation:DataTalksClub_Data_engineering_zoomcamp_Docker_Compose_PostgreSQL_Setup
- Implementation:DataTalksClub_Data_engineering_zoomcamp_Pandas_Chunked_CSV_Loading
- Environment:DataTalksClub_Data_engineering_zoomcamp_Docker_PostgreSQL_Python_Environment