Implementation:DataTalksClub Data engineering zoomcamp Web To GCS Upload

Knowledge Sources	DataTalksClub_Data_engineering_zoomcamp
Domains	Data_Ingestion, GCS
Last Updated	2026-02-09 00:00 GMT

Overview

A Python script that downloads NYC taxi trip CSV data from GitHub releases, converts it to Parquet format using Pandas, and uploads the resulting files to Google Cloud Storage.

Description

This script provides an end-to-end pipeline for ingesting NYC taxi trip data into GCS. It defines two core functions: upload_to_gcs handles uploading a local file to a specified GCS bucket using the Google Cloud Storage Python client, and web_to_gcs orchestrates the full workflow by iterating through all 12 months of a given year for a specified taxi service type (yellow, green, or FHV). For each month, it downloads the compressed CSV file from the DataTalksClub GitHub releases, reads it into a Pandas DataFrame, converts it to Parquet format using the PyArrow engine, and uploads the Parquet file to GCS under a service-specific prefix. The GCS bucket name is configurable via the GCP_GCS_BUCKET environment variable, falling back to a default placeholder value.

Usage

Use this implementation when you need to bulk-load historical NYC taxi trip data into Google Cloud Storage in Parquet format. It is suitable for initial data lake population or backfilling data for a specific year and taxi service type. Prerequisites include installing pandas, pyarrow, and google-cloud-storage, setting the GOOGLE_APPLICATION_CREDENTIALS environment variable, and configuring the GCP_GCS_BUCKET environment variable.

Code Reference

Source Location

Repository: DataTalksClub_Data_engineering_zoomcamp
File: 03-data-warehouse/extras/web_to_gcs.py
Lines: 1-66

Signature

def upload_to_gcs(bucket, object_name, local_file):
    ...

def web_to_gcs(year, service):
    ...

Import

import io
import os
import requests
import pandas as pd
from google.cloud import storage

I/O Contract

Inputs

upload_to_gcs:

Name	Type	Required	Description
bucket	str	Yes	Name of the target GCS bucket
object_name	str	Yes	Destination object path within the GCS bucket
local_file	str	Yes	Path to the local file to upload

web_to_gcs:

Name	Type	Required	Description
year	str	Yes	The year of data to download (e.g., '2019', '2020')
service	str	Yes	The taxi service type (e.g., 'yellow', 'green', 'fhv')

Outputs

Name	Type	Description
GCS objects	Parquet files	12 monthly Parquet files uploaded to GCS under the path {service}/{service}_tripdata_{year}-{month}.parquet
Local files	.parquet files	Parquet files written to the local working directory as a side effect

Usage Examples

Basic Usage

# Download all 12 months of green taxi data for 2019 and upload to GCS
web_to_gcs('2019', 'green')

# Download all 12 months of yellow taxi data for 2020 and upload to GCS
web_to_gcs('2020', 'yellow')

Upload a Single File

# Upload a specific local Parquet file to GCS
upload_to_gcs(
    bucket="my-gcs-bucket",
    object_name="green/green_tripdata_2019-01.parquet",
    local_file="green_tripdata_2019-01.parquet"
)

Related Pages

Principle:DataTalksClub_Data_engineering_zoomcamp_GCS_Data_Upload

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment