Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:DataTalksClub Data engineering zoomcamp Docker PostgreSQL Data Ingestion

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Docker, Data_Ingestion
Last Updated 2026-02-09 07:00 GMT

Overview

End-to-end process for containerizing a Python data ingestion pipeline that downloads NYC taxi trip data and loads it into a Dockerized PostgreSQL database.

Description

This workflow demonstrates the foundational data engineering pattern of building a reproducible, containerized data pipeline. It uses Docker to run PostgreSQL and pgAdmin as services, then executes a Python script that downloads NYC taxi CSV data from a remote source, reads it in chunks using pandas, and bulk-loads it into a PostgreSQL table using SQLAlchemy. The pipeline handles schema creation, chunked iteration for memory efficiency, and progress tracking.

Usage

Execute this workflow when you need to set up a local development environment for exploring and querying NYC taxi trip data. It is the starting point for the Data Engineering Zoomcamp and assumes no prior infrastructure. You have access to Docker and want to ingest publicly available CSV data into a relational database for SQL-based analysis.

Execution Steps

Step 1: Environment Setup

Launch PostgreSQL and pgAdmin as Docker containers using Docker Compose. This creates a persistent database service with a web-based admin interface, connected through a shared Docker network so that both containers can communicate.

Key considerations:

  • Map a local volume to persist PostgreSQL data across container restarts
  • Expose ports for external client access (5432 for PostgreSQL, 8080 for pgAdmin)
  • Configure environment variables for database credentials

Step 2: Data Source Configuration

Configure the data source URL and target table parameters. The pipeline constructs download URLs based on taxi color (yellow/green), year, and month parameters, pointing to the DataTalksClub NYC TLC data releases on GitHub.

Key considerations:

  • URLs follow a predictable pattern based on taxi type and date
  • Data is served as gzipped CSV files
  • Column dtypes must be explicitly defined to handle nullable integer fields

Step 3: Database Connection

Establish a connection to the PostgreSQL database using SQLAlchemy with the psycopg driver. The connection string is assembled from CLI parameters (host, port, user, password, database name).

Key considerations:

  • Use SQLAlchemy engine for pandas integration
  • Connection parameters are passed via CLI arguments using Click
  • The engine handles connection pooling automatically

Step 4: Chunked Data Ingestion

Read the remote CSV file in configurable chunks (default 100,000 rows) using pandas iterator mode. For the first chunk, create the target table schema by writing an empty DataFrame. Then append each subsequent chunk to the table, tracking progress with tqdm.

Key considerations:

  • Iterator mode prevents loading the entire file into memory
  • First chunk uses if_exists='replace' to create/reset the table schema
  • Subsequent chunks use if_exists='append' for incremental loading
  • Date columns are parsed during read for proper timestamp types

Step 5: Pipeline Containerization

Package the ingestion script into its own Docker image with all dependencies (pandas, SQLAlchemy, Click, tqdm). Run the containerized pipeline on the same Docker network as PostgreSQL so it can connect via the container hostname.

Key considerations:

  • The ingestion container must be on the same Docker network as PostgreSQL
  • Use the PostgreSQL container name as the hostname (not localhost)
  • All CLI parameters can be overridden at container runtime

Execution Diagram

GitHub URL

Workflow Repository