Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:DataTalksClub Data engineering zoomcamp dlt Data Ingestion

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Data_Ingestion, ELT
Last Updated 2026-02-09 07:00 GMT

Overview

End-to-end data ingestion pipeline using dlt (data load tool) to dynamically load NYC taxi trip Parquet data into BigQuery, supporting both direct web-to-warehouse and GCS-staged loading paths.

Description

This workflow demonstrates modern data ingestion using dlt, a Python-first data pipeline library. It generates download URLs for NYC taxi trip Parquet files based on user-specified date ranges and taxi colors, then offers two loading strategies: (1) download to Google Cloud Storage first, then load from GCS to BigQuery using dlt's filesystem source with Parquet reader; or (2) stream Parquet files directly from the web to BigQuery in memory-efficient 1MB chunks. Both paths use dlt's pipeline abstraction with replace write disposition for full refreshes and automatic schema inference from Parquet metadata.

Usage

Execute this workflow when you need to bulk-load NYC taxi trip data into BigQuery for analysis. Use the GCS-staged path when you want to retain raw files in cloud storage for auditing or reprocessing. Use the direct web path when you want the simplest possible setup without intermediate storage. Requires GCP credentials configured in a TOML secrets file.

Execution Steps

Step 1: Credentials Configuration

Load GCP service account credentials from a TOML secrets file (.dlt/secrets.toml) and set them as environment variables. The credentials include project ID, private key, and client email, which dlt and the GCS client use for authentication.

Key considerations:

  • Secrets file must follow the dlt TOML format with a [credentials] section
  • Environment variables are set programmatically for dlt's credential resolution
  • A separate GCS JSON key file is needed for the GCS-staged loading path

Step 2: URL Generation

Construct download URLs for NYC taxi trip Parquet files based on user-specified parameters: taxi color (green/yellow), start year, end year, start month, and end month. URLs point to the NYC TLC CloudFront distribution.

Key considerations:

Step 3: Loading Path Selection

Choose between two loading strategies based on user input. Path 1 (GCS-staged) downloads files to a GCS bucket first, then reads them using dlt's filesystem source. Path 2 (direct web) streams Parquet files from HTTP directly into memory and yields PyArrow tables.

Key considerations:

  • GCS path provides data lake retention of raw files
  • Direct web path is simpler but offers no intermediate storage
  • Both paths produce the same BigQuery result

Step 4: Data Resource Definition

Define a dlt resource function that yields data records. For the GCS path, the resource uses dlt's filesystem source with read_parquet transformer to iterate over GCS Parquet files. For the direct web path, the resource streams HTTP responses in 1MB chunks, assembles them into PyArrow tables, and yields complete tables.

Key considerations:

  • write_disposition="replace" drops and recreates the target table on each run
  • PyArrow tables are yielded directly for efficient BigQuery loading
  • Error handling catches failed HTTP requests and continues with remaining URLs

Step 5: Pipeline Execution

Create a dlt pipeline configured with a pipeline name, dataset name, and BigQuery as the destination. Run the pipeline with the chosen resource function. dlt handles schema inference, table creation, and data loading automatically.

Key considerations:

  • Pipeline name enables dlt state tracking across runs
  • Dataset name maps to a BigQuery dataset
  • dlt automatically infers schema from Parquet column types
  • The run result (info) contains load statistics and any error details

Execution Diagram

GitHub URL

Workflow Repository