Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:DataTalksClub Data engineering zoomcamp Generate Urls Function

From Leeroopedia


Page Metadata
Knowledge Sources repo: DataTalksClub/data-engineering-zoomcamp, dlt docs: dlt Documentation
Domains Data_Engineering, Data_Ingestion
Last Updated 2026-02-09 14:00 GMT

Overview

Concrete tool for generating a list of NYC TLC trip data Parquet file URLs based on user-specified taxi color, year range, and month range parameters.

Description

The generate_urls function constructs a list of download URLs for NYC Taxi and Limousine Commission (TLC) trip record data hosted on the CloudFront CDN. It accepts a taxi color (e.g., "green" or "yellow"), a start and end year, and a start and end month. The function iterates over the Cartesian product of the year and month ranges, formatting each combination into a URL that follows the TLC data distribution naming convention.

This is an API Doc implementation. The function provides a clean, parameterized interface for data source discovery. The base URL points to the TLC CloudFront distribution at https://d37ci6vzurychx.cloudfront.net/trip-data/, and each file follows the naming pattern {color}_tripdata_{year}-{month:02d}.parquet.

The function uses zero-padded month formatting (f"{month:02d}") to match the upstream naming convention where months are always represented as two digits (e.g., 01 through 12).

Usage

Use this function when:

  • Preparing to ingest NYC TLC trip data for a specific date range and taxi type
  • Building a list of URLs to pass to a dlt resource for batch downloading
  • The user needs to specify the ingestion scope at runtime via interactive prompts

Code Reference

Source Location: cohorts/2025/workshops/dynamic_load_dlt.py, lines 25-39

Signature:

def generate_urls(color, start_year, end_year, start_month, end_month):
    # Returns List[str] of Parquet file URLs

Import:

# No additional imports required; uses only Python built-ins

I/O Contract

Inputs:

Parameter Type Required Description
color str Yes Taxi trip color type (e.g., "green", "yellow")
start_year int Yes First year in the range (inclusive), e.g., 2019
end_year int Yes Last year in the range (inclusive), e.g., 2022
start_month int Yes First month in the range (inclusive), 1-12
end_month int Yes Last month in the range (inclusive), 1-12

Outputs:

Output Type Description
urls List[str] Ordered list of fully qualified Parquet file URLs following the pattern https://d37ci6vzurychx.cloudfront.net/trip-data/{color}_tripdata_{year}-{month:02d}.parquet

Usage Examples

Generating URLs for a single year and all 12 months:

urls = generate_urls("green", 2021, 2021, 1, 12)
# Returns 12 URLs:
# ['https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2021-01.parquet',
#  'https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2021-02.parquet',
#  ...
#  'https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2021-12.parquet']

Generating URLs for multiple years with a restricted month range:

urls = generate_urls("yellow", 2020, 2022, 1, 3)
# Returns 9 URLs (3 years x 3 months):
# ['https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2020-01.parquet',
#  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2020-02.parquet',
#  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2020-03.parquet',
#  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet',
#  ...
#  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-03.parquet']

Using with interactive user input (as in the source script):

color = input("Enter color (green, yellow): ").lower()
start_year = int(input("Enter the start year (e.g., 2019): "))
end_year = int(input("Enter the end year (e.g., 2022): "))
start_month = int(input("Enter the start month (1-12): "))
end_month = int(input("Enter the end month (1-12): "))

urls = generate_urls(color, start_year, end_year, start_month, end_month)
print(f"Generated {len(urls)} URLs")
for url in urls:
    print(url)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment