Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:DataTalksClub Data engineering zoomcamp Dlt BigQuery Environment

From Leeroopedia


Knowledge Sources
Domains Data_Ingestion, Cloud_Infrastructure
Last Updated 2026-02-09 07:00 GMT

cohorts/2025/workshops/dynamic_load_dlt.py

Overview

Python environment with dlt (data load tool), Google Cloud Storage, and BigQuery for flexible data ingestion pipelines loading NYC taxi Parquet data.

Description

This environment provides the dlt-based data ingestion runtime for loading NYC taxi trip data into BigQuery. It supports two loading paths: (1) downloading Parquet files to GCS then loading via dlt filesystem source, or (2) streaming Parquet files directly from the web into BigQuery. The environment requires GCP service account credentials stored in a TOML configuration file, and uses PyArrow for Parquet parsing and Google Cloud Storage client for bucket operations.

Usage

Use this environment for any dlt-based data ingestion workflow that loads data into BigQuery. It is the mandatory prerequisite for running the Toml_Credentials_Loader, Generate_Urls_Function, Dlt_Loading_Method_Selection, Dlt_Resource_Decorator, and Dlt_Pipeline_Run implementations.

System Requirements

Category Requirement Notes
OS Linux, macOS, or Windows No Docker required
Python Python 3.9+ Required by dlt library
Disk ~1GB free For temporary Parquet file downloads
Network Internet access Downloads data from NYC TLC CloudFront CDN and uploads to GCS/BigQuery

Dependencies

Python Packages

  • `dlt` with BigQuery destination (`pip install "dlt[bigquery]"`)
  • `dlt` filesystem source (`dlt.sources.filesystem`)
  • `google-cloud-storage` (for GCS loading path)
  • `pyarrow` (for Parquet file parsing)
  • `toml` (for TOML credentials loading)
  • `requests` (for HTTP file downloads)

Credentials

The following credentials must be configured in `.dlt/secrets.toml`:

[credentials]
project_id = "your-gcp-project-id"
private_key = "your-service-account-private-key"
client_email = "your-service-account-email"

For the GCS loading path, an additional service account JSON file (`gcs.json`) is required:

  • `gcs.json`: GCP service account JSON file with Storage Admin and BigQuery Admin permissions

Warning: Never commit `.dlt/secrets.toml` or `gcs.json` to version control.

Quick Install

# Install dlt with BigQuery support
pip install "dlt[bigquery]" google-cloud-storage pyarrow toml requests

# Create credentials file
mkdir -p .dlt
# Edit .dlt/secrets.toml with your GCP credentials

Code Evidence

TOML credentials loading from `dynamic_load_dlt.py:17-22`:

config = toml.load("./.dlt/secrets.toml")

# Set environment variables
os.environ["CREDENTIALS__PROJECT_ID"] = config["credentials"]["project_id"]
os.environ["CREDENTIALS__PRIVATE_KEY"] = config["credentials"]["private_key"]
os.environ["CREDENTIALS__CLIENT_EMAIL"] = config["credentials"]["client_email"]

GCS client initialization from `dynamic_load_dlt.py:63`:

    storage_client = storage.Client.from_service_account_json("gcs.json")

dlt pipeline configuration from `dynamic_load_dlt.py:111-115`:

pipeline = dlt.pipeline(
    pipeline_name="test_taxi",
    dataset_name=input("Enter the dataset name: "),
    destination="bigquery"
)

Common Errors

Error Message Cause Solution
`FileNotFoundError: .dlt/secrets.toml` Credentials file missing Create `.dlt/secrets.toml` with GCP credentials
`google.auth.exceptions.DefaultCredentialsError` Invalid or missing GCP credentials Verify service account JSON and TOML credentials match
`dlt.common.exceptions.DestinationTerminalException` BigQuery dataset or permissions issue Ensure service account has BigQuery Data Editor role

Compatibility Notes

  • Two loading paths: Method 1 (GCS -> BigQuery) requires both a service account JSON file and TOML credentials. Method 2 (Web -> BigQuery) only requires TOML credentials but streams all data through the local machine.
  • Memory usage: Method 2 uses 1MB streaming chunks to keep memory usage low during downloads.
  • dlt version: The code uses `dlt.sources.filesystem` which requires dlt >= 0.3.0 with the filesystem source installed.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment