Heuristic:DataTalksClub Data engineering zoomcamp CSV Chunk Size Optimization
| Knowledge Sources | |
|---|---|
| Domains | Data_Ingestion, Optimization |
| Last Updated | 2026-02-09 07:00 GMT |
01-docker-terraform/docker-sql/pipeline/ingest_data.py
Overview
Use a chunk size of 100,000 rows when reading large CSV files with pandas to balance memory usage against database insert throughput during data ingestion.
Description
When ingesting large CSV files (millions of rows) into PostgreSQL using pandas, reading the entire file into memory at once can cause Out of Memory errors. The chunked reading pattern uses `pd.read_csv(..., iterator=True, chunksize=N)` to process N rows at a time, writing each chunk to the database before loading the next. The chunk size of 100,000 rows is the default used throughout the repository for NYC taxi data ingestion.
Usage
Use this heuristic when loading large CSV files into a database using pandas. Apply the 100,000-row chunk size as a starting default, and adjust based on available memory and row width. Narrower rows (fewer columns) can support larger chunk sizes; wider rows need smaller chunks.
The Insight (Rule of Thumb)
- Action: Set `chunksize=100000` when calling `pd.read_csv()` with `iterator=True`.
- Value: 100,000 rows per chunk (default in the ingestion script).
- Trade-off: Smaller chunks use less memory but incur more database round-trips. Larger chunks are faster but require more RAM. 100,000 is a pragmatic default for taxi data with ~18 columns.
Reasoning
NYC taxi data files contain millions of rows (e.g., yellow taxi 2021-01 has ~1.4M rows). Loading all rows into a single pandas DataFrame would require several GB of RAM. By using chunked iteration, peak memory stays bounded by `chunksize * row_size`. Each chunk is written to PostgreSQL via `df.to_sql()` with `if_exists='append'`, enabling progress tracking via tqdm. The 100,000-row default balances the overhead of per-chunk `to_sql()` calls against memory safety.
Code Evidence
Chunk size configuration from `ingest_data.py:43`:
@click.option('--chunksize', default=100000, type=int, help='Chunk size for reading CSV')
Chunked reading and writing from `ingest_data.py:51-74`:
df_iter = pd.read_csv(
url,
dtype=dtype,
parse_dates=parse_dates,
iterator=True,
chunksize=chunksize,
)
first = True
for df_chunk in tqdm(df_iter):
if first:
df_chunk.head(0).to_sql(
name=target_table,
con=engine,
if_exists='replace'
)
first = False
df_chunk.to_sql(
name=target_table,
con=engine,
if_exists='append'
)