Environment:DataTalksClub Data engineering zoomcamp Dlt BigQuery Environment

Knowledge Sources	Data Engineering Zoomcamp dlt Documentation
Domains	Data_Ingestion, Cloud_Infrastructure
Last Updated	2026-02-09 07:00 GMT

cohorts/2025/workshops/dynamic_load_dlt.py

Overview

Python environment with dlt (data load tool), Google Cloud Storage, and BigQuery for flexible data ingestion pipelines loading NYC taxi Parquet data.

Description

This environment provides the dlt-based data ingestion runtime for loading NYC taxi trip data into BigQuery. It supports two loading paths: (1) downloading Parquet files to GCS then loading via dlt filesystem source, or (2) streaming Parquet files directly from the web into BigQuery. The environment requires GCP service account credentials stored in a TOML configuration file, and uses PyArrow for Parquet parsing and Google Cloud Storage client for bucket operations.

Usage

Use this environment for any dlt-based data ingestion workflow that loads data into BigQuery. It is the mandatory prerequisite for running the Toml_Credentials_Loader, Generate_Urls_Function, Dlt_Loading_Method_Selection, Dlt_Resource_Decorator, and Dlt_Pipeline_Run implementations.

System Requirements

Category	Requirement	Notes
OS	Linux, macOS, or Windows	No Docker required
Python	Python 3.9+	Required by dlt library
Disk	~1GB free	For temporary Parquet file downloads
Network	Internet access	Downloads data from NYC TLC CloudFront CDN and uploads to GCS/BigQuery

Dependencies

Python Packages

`dlt` with BigQuery destination (`pip install "dlt[bigquery]"`)
`dlt` filesystem source (`dlt.sources.filesystem`)
`google-cloud-storage` (for GCS loading path)
`pyarrow` (for Parquet file parsing)
`toml` (for TOML credentials loading)
`requests` (for HTTP file downloads)

Credentials

The following credentials must be configured in `.dlt/secrets.toml`:

[credentials]
project_id = "your-gcp-project-id"
private_key = "your-service-account-private-key"
client_email = "your-service-account-email"

For the GCS loading path, an additional service account JSON file (`gcs.json`) is required:

`gcs.json`: GCP service account JSON file with Storage Admin and BigQuery Admin permissions

Warning: Never commit `.dlt/secrets.toml` or `gcs.json` to version control.

Quick Install

# Install dlt with BigQuery support
pip install "dlt[bigquery]" google-cloud-storage pyarrow toml requests

# Create credentials file
mkdir -p .dlt
# Edit .dlt/secrets.toml with your GCP credentials

Code Evidence

TOML credentials loading from `dynamic_load_dlt.py:17-22`:

config = toml.load("./.dlt/secrets.toml")

# Set environment variables
os.environ["CREDENTIALS__PROJECT_ID"] = config["credentials"]["project_id"]
os.environ["CREDENTIALS__PRIVATE_KEY"] = config["credentials"]["private_key"]
os.environ["CREDENTIALS__CLIENT_EMAIL"] = config["credentials"]["client_email"]

GCS client initialization from `dynamic_load_dlt.py:63`:

    storage_client = storage.Client.from_service_account_json("gcs.json")

dlt pipeline configuration from `dynamic_load_dlt.py:111-115`:

pipeline = dlt.pipeline(
    pipeline_name="test_taxi",
    dataset_name=input("Enter the dataset name: "),
    destination="bigquery"
)

Common Errors

Error Message	Cause	Solution
`FileNotFoundError: .dlt/secrets.toml`	Credentials file missing	Create `.dlt/secrets.toml` with GCP credentials
`google.auth.exceptions.DefaultCredentialsError`	Invalid or missing GCP credentials	Verify service account JSON and TOML credentials match
`dlt.common.exceptions.DestinationTerminalException`	BigQuery dataset or permissions issue	Ensure service account has BigQuery Data Editor role

Compatibility Notes

Two loading paths: Method 1 (GCS -> BigQuery) requires both a service account JSON file and TOML credentials. Method 2 (Web -> BigQuery) only requires TOML credentials but streams all data through the local machine.
Memory usage: Method 2 uses 1MB streaming chunks to keep memory usage low during downloads.
dlt version: The code uses `dlt.sources.filesystem` which requires dlt >= 0.3.0 with the filesystem source installed.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment