Implementation:DataTalksClub Data engineering zoomcamp Web To GCS Upload
| Knowledge Sources | |
|---|---|
| Domains | Data_Ingestion, GCS |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A Python script that downloads NYC taxi trip CSV data from GitHub releases, converts it to Parquet format using Pandas, and uploads the resulting files to Google Cloud Storage.
Description
This script provides an end-to-end pipeline for ingesting NYC taxi trip data into GCS. It defines two core functions: upload_to_gcs handles uploading a local file to a specified GCS bucket using the Google Cloud Storage Python client, and web_to_gcs orchestrates the full workflow by iterating through all 12 months of a given year for a specified taxi service type (yellow, green, or FHV). For each month, it downloads the compressed CSV file from the DataTalksClub GitHub releases, reads it into a Pandas DataFrame, converts it to Parquet format using the PyArrow engine, and uploads the Parquet file to GCS under a service-specific prefix. The GCS bucket name is configurable via the GCP_GCS_BUCKET environment variable, falling back to a default placeholder value.
Usage
Use this implementation when you need to bulk-load historical NYC taxi trip data into Google Cloud Storage in Parquet format. It is suitable for initial data lake population or backfilling data for a specific year and taxi service type. Prerequisites include installing pandas, pyarrow, and google-cloud-storage, setting the GOOGLE_APPLICATION_CREDENTIALS environment variable, and configuring the GCP_GCS_BUCKET environment variable.
Code Reference
Source Location
- Repository: DataTalksClub_Data_engineering_zoomcamp
- File: 03-data-warehouse/extras/web_to_gcs.py
- Lines: 1-66
Signature
def upload_to_gcs(bucket, object_name, local_file):
...
def web_to_gcs(year, service):
...
Import
import io
import os
import requests
import pandas as pd
from google.cloud import storage
I/O Contract
Inputs
upload_to_gcs:
| Name | Type | Required | Description |
|---|---|---|---|
| bucket | str | Yes | Name of the target GCS bucket |
| object_name | str | Yes | Destination object path within the GCS bucket |
| local_file | str | Yes | Path to the local file to upload |
web_to_gcs:
| Name | Type | Required | Description |
|---|---|---|---|
| year | str | Yes | The year of data to download (e.g., '2019', '2020') |
| service | str | Yes | The taxi service type (e.g., 'yellow', 'green', 'fhv') |
Outputs
| Name | Type | Description |
|---|---|---|
| GCS objects | Parquet files | 12 monthly Parquet files uploaded to GCS under the path {service}/{service}_tripdata_{year}-{month}.parquet |
| Local files | .parquet files | Parquet files written to the local working directory as a side effect |
Usage Examples
Basic Usage
# Download all 12 months of green taxi data for 2019 and upload to GCS
web_to_gcs('2019', 'green')
# Download all 12 months of yellow taxi data for 2020 and upload to GCS
web_to_gcs('2020', 'yellow')
Upload a Single File
# Upload a specific local Parquet file to GCS
upload_to_gcs(
bucket="my-gcs-bucket",
object_name="green/green_tripdata_2019-01.parquet",
local_file="green_tripdata_2019-01.parquet"
)