Implementation:DataTalksClub Data engineering zoomcamp Generate Urls Function
| Page Metadata | |
|---|---|
| Knowledge Sources | repo: DataTalksClub/data-engineering-zoomcamp, dlt docs: dlt Documentation |
| Domains | Data_Engineering, Data_Ingestion |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
Concrete tool for generating a list of NYC TLC trip data Parquet file URLs based on user-specified taxi color, year range, and month range parameters.
Description
The generate_urls function constructs a list of download URLs for NYC Taxi and Limousine Commission (TLC) trip record data hosted on the CloudFront CDN. It accepts a taxi color (e.g., "green" or "yellow"), a start and end year, and a start and end month. The function iterates over the Cartesian product of the year and month ranges, formatting each combination into a URL that follows the TLC data distribution naming convention.
This is an API Doc implementation. The function provides a clean, parameterized interface for data source discovery. The base URL points to the TLC CloudFront distribution at https://d37ci6vzurychx.cloudfront.net/trip-data/, and each file follows the naming pattern {color}_tripdata_{year}-{month:02d}.parquet.
The function uses zero-padded month formatting (f"{month:02d}") to match the upstream naming convention where months are always represented as two digits (e.g., 01 through 12).
Usage
Use this function when:
- Preparing to ingest NYC TLC trip data for a specific date range and taxi type
- Building a list of URLs to pass to a dlt resource for batch downloading
- The user needs to specify the ingestion scope at runtime via interactive prompts
Code Reference
Source Location: cohorts/2025/workshops/dynamic_load_dlt.py, lines 25-39
Signature:
def generate_urls(color, start_year, end_year, start_month, end_month):
# Returns List[str] of Parquet file URLs
Import:
# No additional imports required; uses only Python built-ins
I/O Contract
Inputs:
| Parameter | Type | Required | Description |
|---|---|---|---|
| color | str | Yes | Taxi trip color type (e.g., "green", "yellow")
|
| start_year | int | Yes | First year in the range (inclusive), e.g., 2019
|
| end_year | int | Yes | Last year in the range (inclusive), e.g., 2022
|
| start_month | int | Yes | First month in the range (inclusive), 1-12 |
| end_month | int | Yes | Last month in the range (inclusive), 1-12 |
Outputs:
| Output | Type | Description |
|---|---|---|
| urls | List[str] | Ordered list of fully qualified Parquet file URLs following the pattern https://d37ci6vzurychx.cloudfront.net/trip-data/{color}_tripdata_{year}-{month:02d}.parquet
|
Usage Examples
Generating URLs for a single year and all 12 months:
urls = generate_urls("green", 2021, 2021, 1, 12)
# Returns 12 URLs:
# ['https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2021-01.parquet',
# 'https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2021-02.parquet',
# ...
# 'https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2021-12.parquet']
Generating URLs for multiple years with a restricted month range:
urls = generate_urls("yellow", 2020, 2022, 1, 3)
# Returns 9 URLs (3 years x 3 months):
# ['https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2020-01.parquet',
# 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2020-02.parquet',
# 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2020-03.parquet',
# 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet',
# ...
# 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-03.parquet']
Using with interactive user input (as in the source script):
color = input("Enter color (green, yellow): ").lower()
start_year = int(input("Enter the start year (e.g., 2019): "))
end_year = int(input("Enter the end year (e.g., 2022): "))
start_month = int(input("Enter the start month (1-12): "))
end_month = int(input("Enter the end month (1-12): "))
urls = generate_urls(color, start_year, end_year, start_month, end_month)
print(f"Generated {len(urls)} URLs")
for url in urls:
print(url)