Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:DataTalksClub Data engineering zoomcamp Pandas Dtype Configuration

From Leeroopedia


Metadata
Knowledge Sources DataTalksClub/data-engineering-zoomcamp
Domains pandas, Data Types, CSV Parsing, NYC Taxi Data
Last Updated 2026-02-09 14:00 GMT

Overview

Concrete tool for defining a Python dictionary mapping 16 NYC yellow taxi column names to pandas dtypes, a list of datetime columns for parsing, and a parameterized URL template for accessing monthly CSV.gz data files.

Description

This implementation provides the explicit schema configuration used by the ingest_data.py pipeline. It defines a dtype dictionary at module level that maps every column in the NYC yellow taxi dataset to its pandas-compatible data type. Nullable integer types (Int64 with capital I) are used for columns like VendorID, passenger_count, and payment_type that may contain null values. Floating-point columns use float64 for monetary amounts and distances. The store_and_fwd_flag column uses the string type.

A separate parse_dates list identifies the two datetime columns (tpep_pickup_datetime and tpep_dropoff_datetime) that should be parsed into pandas Timestamp objects.

The URL is constructed using an f-string template that interpolates year and month CLI parameters into the GitHub release download path.

Usage

These configuration objects are defined at the module level of ingest_data.py and passed directly to pd.read_csv(). The dtype and parse_dates parameters override pandas auto-inference, ensuring consistent column types across all monthly data partitions.

Code Reference

Source Location: 01-docker-terraform/docker-sql/pipeline/ingest_data.py:L9-31

Signature:

dtype = {
    "VendorID": "Int64",
    "passenger_count": "Int64",
    "trip_distance": "float64",
    "RatecodeID": "Int64",
    "store_and_fwd_flag": "string",
    "PULocationID": "Int64",
    "DOLocationID": "Int64",
    "payment_type": "Int64",
    "fare_amount": "float64",
    "extra": "float64",
    "mta_tax": "float64",
    "tip_amount": "float64",
    "tolls_amount": "float64",
    "improvement_surcharge": "float64",
    "total_amount": "float64",
    "congestion_surcharge": "float64"
}

parse_dates = [
    "tpep_pickup_datetime",
    "tpep_dropoff_datetime"
]

URL Template (L46-47):

prefix = 'https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow'
url = f'{prefix}/yellow_tripdata_{year}-{month:02d}.csv.gz'

Import: import pandas as pd

I/O Contract

Inputs

Name Type Default Description
year int 2021 Year of the taxi data partition (CLI parameter via Click)
month int 1 Month of the taxi data partition, zero-padded in URL (CLI parameter via Click)

Outputs

Name Type Description
dtype dict[str, str] Dictionary mapping 16 column names to pandas dtype strings
parse_dates list[str] List of 2 column names to parse as datetime
url str Fully resolved URL to the CSV.gz file on GitHub Releases, e.g., https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz

Usage Examples

Using the dtype and parse_dates with pd.read_csv:

import pandas as pd

dtype = {
    "VendorID": "Int64",
    "passenger_count": "Int64",
    "trip_distance": "float64",
    "RatecodeID": "Int64",
    "store_and_fwd_flag": "string",
    "PULocationID": "Int64",
    "DOLocationID": "Int64",
    "payment_type": "Int64",
    "fare_amount": "float64",
    "extra": "float64",
    "mta_tax": "float64",
    "tip_amount": "float64",
    "tolls_amount": "float64",
    "improvement_surcharge": "float64",
    "total_amount": "float64",
    "congestion_surcharge": "float64"
}

parse_dates = ["tpep_pickup_datetime", "tpep_dropoff_datetime"]

url = f'https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz'

df = pd.read_csv(url, dtype=dtype, parse_dates=parse_dates)
print(df.dtypes)

Constructing the URL for different months:

year = 2021
month = 7

prefix = 'https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow'
url = f'{prefix}/yellow_tripdata_{year}-{month:02d}.csv.gz'
# Result: https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-07.csv.gz

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment