Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:DataTalksClub Data engineering zoomcamp Web To GCS Upload

From Leeroopedia


Knowledge Sources
Domains Data_Ingestion, GCS
Last Updated 2026-02-09 00:00 GMT

Overview

A Python script that downloads NYC taxi trip CSV data from GitHub releases, converts it to Parquet format using Pandas, and uploads the resulting files to Google Cloud Storage.

Description

This script provides an end-to-end pipeline for ingesting NYC taxi trip data into GCS. It defines two core functions: upload_to_gcs handles uploading a local file to a specified GCS bucket using the Google Cloud Storage Python client, and web_to_gcs orchestrates the full workflow by iterating through all 12 months of a given year for a specified taxi service type (yellow, green, or FHV). For each month, it downloads the compressed CSV file from the DataTalksClub GitHub releases, reads it into a Pandas DataFrame, converts it to Parquet format using the PyArrow engine, and uploads the Parquet file to GCS under a service-specific prefix. The GCS bucket name is configurable via the GCP_GCS_BUCKET environment variable, falling back to a default placeholder value.

Usage

Use this implementation when you need to bulk-load historical NYC taxi trip data into Google Cloud Storage in Parquet format. It is suitable for initial data lake population or backfilling data for a specific year and taxi service type. Prerequisites include installing pandas, pyarrow, and google-cloud-storage, setting the GOOGLE_APPLICATION_CREDENTIALS environment variable, and configuring the GCP_GCS_BUCKET environment variable.

Code Reference

Source Location

Signature

def upload_to_gcs(bucket, object_name, local_file):
    ...

def web_to_gcs(year, service):
    ...

Import

import io
import os
import requests
import pandas as pd
from google.cloud import storage

I/O Contract

Inputs

upload_to_gcs:

Name Type Required Description
bucket str Yes Name of the target GCS bucket
object_name str Yes Destination object path within the GCS bucket
local_file str Yes Path to the local file to upload

web_to_gcs:

Name Type Required Description
year str Yes The year of data to download (e.g., '2019', '2020')
service str Yes The taxi service type (e.g., 'yellow', 'green', 'fhv')

Outputs

Name Type Description
GCS objects Parquet files 12 monthly Parquet files uploaded to GCS under the path {service}/{service}_tripdata_{year}-{month}.parquet
Local files .parquet files Parquet files written to the local working directory as a side effect

Usage Examples

Basic Usage

# Download all 12 months of green taxi data for 2019 and upload to GCS
web_to_gcs('2019', 'green')

# Download all 12 months of yellow taxi data for 2020 and upload to GCS
web_to_gcs('2020', 'yellow')

Upload a Single File

# Upload a specific local Parquet file to GCS
upload_to_gcs(
    bucket="my-gcs-bucket",
    object_name="green/green_tripdata_2019-01.parquet",
    local_file="green_tripdata_2019-01.parquet"
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment