Implementation:DataTalksClub Data engineering zoomcamp Pandas Dtype Configuration
| Metadata | |
|---|---|
| Knowledge Sources | DataTalksClub/data-engineering-zoomcamp |
| Domains | pandas, Data Types, CSV Parsing, NYC Taxi Data |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
Concrete tool for defining a Python dictionary mapping 16 NYC yellow taxi column names to pandas dtypes, a list of datetime columns for parsing, and a parameterized URL template for accessing monthly CSV.gz data files.
Description
This implementation provides the explicit schema configuration used by the ingest_data.py pipeline. It defines a dtype dictionary at module level that maps every column in the NYC yellow taxi dataset to its pandas-compatible data type. Nullable integer types (Int64 with capital I) are used for columns like VendorID, passenger_count, and payment_type that may contain null values. Floating-point columns use float64 for monetary amounts and distances. The store_and_fwd_flag column uses the string type.
A separate parse_dates list identifies the two datetime columns (tpep_pickup_datetime and tpep_dropoff_datetime) that should be parsed into pandas Timestamp objects.
The URL is constructed using an f-string template that interpolates year and month CLI parameters into the GitHub release download path.
Usage
These configuration objects are defined at the module level of ingest_data.py and passed directly to pd.read_csv(). The dtype and parse_dates parameters override pandas auto-inference, ensuring consistent column types across all monthly data partitions.
Code Reference
Source Location: 01-docker-terraform/docker-sql/pipeline/ingest_data.py:L9-31
Signature:
dtype = {
"VendorID": "Int64",
"passenger_count": "Int64",
"trip_distance": "float64",
"RatecodeID": "Int64",
"store_and_fwd_flag": "string",
"PULocationID": "Int64",
"DOLocationID": "Int64",
"payment_type": "Int64",
"fare_amount": "float64",
"extra": "float64",
"mta_tax": "float64",
"tip_amount": "float64",
"tolls_amount": "float64",
"improvement_surcharge": "float64",
"total_amount": "float64",
"congestion_surcharge": "float64"
}
parse_dates = [
"tpep_pickup_datetime",
"tpep_dropoff_datetime"
]
URL Template (L46-47):
prefix = 'https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow'
url = f'{prefix}/yellow_tripdata_{year}-{month:02d}.csv.gz'
Import: import pandas as pd
I/O Contract
Inputs
| Name | Type | Default | Description |
|---|---|---|---|
year |
int |
2021 | Year of the taxi data partition (CLI parameter via Click) |
month |
int |
1 | Month of the taxi data partition, zero-padded in URL (CLI parameter via Click) |
Outputs
| Name | Type | Description |
|---|---|---|
dtype |
dict[str, str] |
Dictionary mapping 16 column names to pandas dtype strings |
parse_dates |
list[str] |
List of 2 column names to parse as datetime |
url |
str |
Fully resolved URL to the CSV.gz file on GitHub Releases, e.g., https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz
|
Usage Examples
Using the dtype and parse_dates with pd.read_csv:
import pandas as pd
dtype = {
"VendorID": "Int64",
"passenger_count": "Int64",
"trip_distance": "float64",
"RatecodeID": "Int64",
"store_and_fwd_flag": "string",
"PULocationID": "Int64",
"DOLocationID": "Int64",
"payment_type": "Int64",
"fare_amount": "float64",
"extra": "float64",
"mta_tax": "float64",
"tip_amount": "float64",
"tolls_amount": "float64",
"improvement_surcharge": "float64",
"total_amount": "float64",
"congestion_surcharge": "float64"
}
parse_dates = ["tpep_pickup_datetime", "tpep_dropoff_datetime"]
url = f'https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz'
df = pd.read_csv(url, dtype=dtype, parse_dates=parse_dates)
print(df.dtypes)
Constructing the URL for different months:
year = 2021
month = 7
prefix = 'https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow'
url = f'{prefix}/yellow_tripdata_{year}-{month:02d}.csv.gz'
# Result: https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-07.csv.gz
Related Pages
- Principle:DataTalksClub_Data_engineering_zoomcamp_Data_Source_Configuration
- Implementation:DataTalksClub_Data_engineering_zoomcamp_Pandas_Chunked_CSV_Loading
- Implementation:DataTalksClub_Data_engineering_zoomcamp_SQLAlchemy_Create_Engine
- Environment:DataTalksClub_Data_engineering_zoomcamp_Docker_PostgreSQL_Python_Environment