Heuristic:DataTalksClub Data engineering zoomcamp CSV Chunk Size Optimization

Knowledge Sources	Data Engineering Zoomcamp pandas read_csv
Domains	Data_Ingestion, Optimization
Last Updated	2026-02-09 07:00 GMT

01-docker-terraform/docker-sql/pipeline/ingest_data.py

Overview

Use a chunk size of 100,000 rows when reading large CSV files with pandas to balance memory usage against database insert throughput during data ingestion.

Description

When ingesting large CSV files (millions of rows) into PostgreSQL using pandas, reading the entire file into memory at once can cause Out of Memory errors. The chunked reading pattern uses `pd.read_csv(..., iterator=True, chunksize=N)` to process N rows at a time, writing each chunk to the database before loading the next. The chunk size of 100,000 rows is the default used throughout the repository for NYC taxi data ingestion.

Usage

Use this heuristic when loading large CSV files into a database using pandas. Apply the 100,000-row chunk size as a starting default, and adjust based on available memory and row width. Narrower rows (fewer columns) can support larger chunk sizes; wider rows need smaller chunks.

The Insight (Rule of Thumb)

Action: Set `chunksize=100000` when calling `pd.read_csv()` with `iterator=True`.
Value: 100,000 rows per chunk (default in the ingestion script).
Trade-off: Smaller chunks use less memory but incur more database round-trips. Larger chunks are faster but require more RAM. 100,000 is a pragmatic default for taxi data with ~18 columns.

Reasoning

NYC taxi data files contain millions of rows (e.g., yellow taxi 2021-01 has ~1.4M rows). Loading all rows into a single pandas DataFrame would require several GB of RAM. By using chunked iteration, peak memory stays bounded by `chunksize * row_size`. Each chunk is written to PostgreSQL via `df.to_sql()` with `if_exists='append'`, enabling progress tracking via tqdm. The 100,000-row default balances the overhead of per-chunk `to_sql()` calls against memory safety.

Code Evidence

Chunk size configuration from `ingest_data.py:43`:

@click.option('--chunksize', default=100000, type=int, help='Chunk size for reading CSV')

Chunked reading and writing from `ingest_data.py:51-74`:

    df_iter = pd.read_csv(
        url,
        dtype=dtype,
        parse_dates=parse_dates,
        iterator=True,
        chunksize=chunksize,
    )

    first = True

    for df_chunk in tqdm(df_iter):
        if first:
            df_chunk.head(0).to_sql(
                name=target_table,
                con=engine,
                if_exists='replace'
            )
            first = False

        df_chunk.to_sql(
            name=target_table,
            con=engine,
            if_exists='append'
        )

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment