Environment:DataTalksClub Data engineering zoomcamp Dlt BigQuery Environment
| Knowledge Sources | |
|---|---|
| Domains | Data_Ingestion, Cloud_Infrastructure |
| Last Updated | 2026-02-09 07:00 GMT |
cohorts/2025/workshops/dynamic_load_dlt.py
Overview
Python environment with dlt (data load tool), Google Cloud Storage, and BigQuery for flexible data ingestion pipelines loading NYC taxi Parquet data.
Description
This environment provides the dlt-based data ingestion runtime for loading NYC taxi trip data into BigQuery. It supports two loading paths: (1) downloading Parquet files to GCS then loading via dlt filesystem source, or (2) streaming Parquet files directly from the web into BigQuery. The environment requires GCP service account credentials stored in a TOML configuration file, and uses PyArrow for Parquet parsing and Google Cloud Storage client for bucket operations.
Usage
Use this environment for any dlt-based data ingestion workflow that loads data into BigQuery. It is the mandatory prerequisite for running the Toml_Credentials_Loader, Generate_Urls_Function, Dlt_Loading_Method_Selection, Dlt_Resource_Decorator, and Dlt_Pipeline_Run implementations.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux, macOS, or Windows | No Docker required |
| Python | Python 3.9+ | Required by dlt library |
| Disk | ~1GB free | For temporary Parquet file downloads |
| Network | Internet access | Downloads data from NYC TLC CloudFront CDN and uploads to GCS/BigQuery |
Dependencies
Python Packages
- `dlt` with BigQuery destination (`pip install "dlt[bigquery]"`)
- `dlt` filesystem source (`dlt.sources.filesystem`)
- `google-cloud-storage` (for GCS loading path)
- `pyarrow` (for Parquet file parsing)
- `toml` (for TOML credentials loading)
- `requests` (for HTTP file downloads)
Credentials
The following credentials must be configured in `.dlt/secrets.toml`:
[credentials]
project_id = "your-gcp-project-id"
private_key = "your-service-account-private-key"
client_email = "your-service-account-email"
For the GCS loading path, an additional service account JSON file (`gcs.json`) is required:
- `gcs.json`: GCP service account JSON file with Storage Admin and BigQuery Admin permissions
Warning: Never commit `.dlt/secrets.toml` or `gcs.json` to version control.
Quick Install
# Install dlt with BigQuery support
pip install "dlt[bigquery]" google-cloud-storage pyarrow toml requests
# Create credentials file
mkdir -p .dlt
# Edit .dlt/secrets.toml with your GCP credentials
Code Evidence
TOML credentials loading from `dynamic_load_dlt.py:17-22`:
config = toml.load("./.dlt/secrets.toml")
# Set environment variables
os.environ["CREDENTIALS__PROJECT_ID"] = config["credentials"]["project_id"]
os.environ["CREDENTIALS__PRIVATE_KEY"] = config["credentials"]["private_key"]
os.environ["CREDENTIALS__CLIENT_EMAIL"] = config["credentials"]["client_email"]
GCS client initialization from `dynamic_load_dlt.py:63`:
storage_client = storage.Client.from_service_account_json("gcs.json")
dlt pipeline configuration from `dynamic_load_dlt.py:111-115`:
pipeline = dlt.pipeline(
pipeline_name="test_taxi",
dataset_name=input("Enter the dataset name: "),
destination="bigquery"
)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `FileNotFoundError: .dlt/secrets.toml` | Credentials file missing | Create `.dlt/secrets.toml` with GCP credentials |
| `google.auth.exceptions.DefaultCredentialsError` | Invalid or missing GCP credentials | Verify service account JSON and TOML credentials match |
| `dlt.common.exceptions.DestinationTerminalException` | BigQuery dataset or permissions issue | Ensure service account has BigQuery Data Editor role |
Compatibility Notes
- Two loading paths: Method 1 (GCS -> BigQuery) requires both a service account JSON file and TOML credentials. Method 2 (Web -> BigQuery) only requires TOML credentials but streams all data through the local machine.
- Memory usage: Method 2 uses 1MB streaming chunks to keep memory usage low during downloads.
- dlt version: The code uses `dlt.sources.filesystem` which requires dlt >= 0.3.0 with the filesystem source installed.
Related Pages
- Implementation:DataTalksClub_Data_engineering_zoomcamp_Toml_Credentials_Loader
- Implementation:DataTalksClub_Data_engineering_zoomcamp_Generate_Urls_Function
- Implementation:DataTalksClub_Data_engineering_zoomcamp_Dlt_Loading_Method_Selection
- Implementation:DataTalksClub_Data_engineering_zoomcamp_Dlt_Resource_Decorator
- Implementation:DataTalksClub_Data_engineering_zoomcamp_Dlt_Pipeline_Run