Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:DataTalksClub Data engineering zoomcamp CSV Chunk Size Optimization

From Leeroopedia



Knowledge Sources
Domains Data_Ingestion, Optimization
Last Updated 2026-02-09 07:00 GMT

01-docker-terraform/docker-sql/pipeline/ingest_data.py

Overview

Use a chunk size of 100,000 rows when reading large CSV files with pandas to balance memory usage against database insert throughput during data ingestion.

Description

When ingesting large CSV files (millions of rows) into PostgreSQL using pandas, reading the entire file into memory at once can cause Out of Memory errors. The chunked reading pattern uses `pd.read_csv(..., iterator=True, chunksize=N)` to process N rows at a time, writing each chunk to the database before loading the next. The chunk size of 100,000 rows is the default used throughout the repository for NYC taxi data ingestion.

Usage

Use this heuristic when loading large CSV files into a database using pandas. Apply the 100,000-row chunk size as a starting default, and adjust based on available memory and row width. Narrower rows (fewer columns) can support larger chunk sizes; wider rows need smaller chunks.

The Insight (Rule of Thumb)

  • Action: Set `chunksize=100000` when calling `pd.read_csv()` with `iterator=True`.
  • Value: 100,000 rows per chunk (default in the ingestion script).
  • Trade-off: Smaller chunks use less memory but incur more database round-trips. Larger chunks are faster but require more RAM. 100,000 is a pragmatic default for taxi data with ~18 columns.

Reasoning

NYC taxi data files contain millions of rows (e.g., yellow taxi 2021-01 has ~1.4M rows). Loading all rows into a single pandas DataFrame would require several GB of RAM. By using chunked iteration, peak memory stays bounded by `chunksize * row_size`. Each chunk is written to PostgreSQL via `df.to_sql()` with `if_exists='append'`, enabling progress tracking via tqdm. The 100,000-row default balances the overhead of per-chunk `to_sql()` calls against memory safety.

Code Evidence

Chunk size configuration from `ingest_data.py:43`:

@click.option('--chunksize', default=100000, type=int, help='Chunk size for reading CSV')

Chunked reading and writing from `ingest_data.py:51-74`:

    df_iter = pd.read_csv(
        url,
        dtype=dtype,
        parse_dates=parse_dates,
        iterator=True,
        chunksize=chunksize,
    )

    first = True

    for df_chunk in tqdm(df_iter):
        if first:
            df_chunk.head(0).to_sql(
                name=target_table,
                con=engine,
                if_exists='replace'
            )
            first = False

        df_chunk.to_sql(
            name=target_table,
            con=engine,
            if_exists='append'
        )

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment