Heuristic:DataTalksClub Data engineering zoomcamp GCS Upload Timeout Workaround
| Knowledge Sources | |
|---|---|
| Domains | Cloud_Infrastructure, Debugging |
| Last Updated | 2026-02-09 07:00 GMT |
cohorts/2022/week_2_data_ingestion/airflow/dags/data_ingestion_gcs_dag.py 03-data-warehouse/extras/web_to_gcs.py
Overview
Workaround for Google Cloud Storage upload timeouts on files larger than 6MB by reducing multipart upload size and chunk size to 5MB, preventing failures on slow connections (800 kbps upload speed).
Description
The Google Cloud Storage Python client library has a default multipart upload threshold and chunk size that can cause timeouts on slow network connections. When uploading files larger than 6MB at speeds around 800 kbps, the default settings cause the upload to timeout before a chunk completes. The workaround overrides two internal constants in the `google.cloud.storage.blob` module to use 5MB chunks instead of the defaults.
Usage
Use this heuristic when uploading files to GCS and experiencing timeout errors, particularly on connections with upload speeds below 1 Mbps. This is a known issue documented in the `googleapis/python-storage` GitHub repository (issue #74).
The Insight (Rule of Thumb)
- Action: Override the GCS blob module's internal multipart and chunk size constants before uploading.
- Value: Set both `_MAX_MULTIPART_SIZE` and `_DEFAULT_CHUNKSIZE` to `5 * 1024 * 1024` (5 MB).
- Trade-off: Smaller chunks mean more HTTP requests per upload but prevent timeouts on slow connections. On fast connections, this is unnecessary overhead.
Reasoning
The default `_MAX_MULTIPART_SIZE` in the GCS Python client is larger than 5MB, causing the library to attempt single-request uploads for medium-sized files. On slow connections (~800 kbps), these single requests timeout before the data is fully transferred. By forcing 5MB chunks, each individual HTTP request completes within the timeout window. The workaround was documented in a GitHub issue (googleapis/python-storage#74) and has been adopted by the community as a pragmatic fix.
At 800 kbps upload speed, uploading a full month of taxi data can take approximately 20 minutes.
Code Evidence
Workaround from `data_ingestion_gcs_dag.py:41-45`:
# WORKAROUND to prevent timeout for files > 6 MB on 800 kbps upload speed.
# (Ref: https://github.com/googleapis/python-storage/issues/74)
storage.blob._MAX_MULTIPART_SIZE = 5 * 1024 * 1024 # 5 MB
storage.blob._DEFAULT_CHUNKSIZE = 5 * 1024 * 1024 # 5 MB
Upload speed note from `data_ingestion_gcs_dag.py:32`:
# NOTE: takes 20 mins, at an upload speed of 800kbps. Faster if your internet has a better upload speed
Same workaround from `web_to_gcs.py:24-27`:
# WORKAROUND to prevent timeout for files > 6 MB on 800 kbps upload speed.
# (Ref: https://github.com/googleapis/python-storage/issues/74)
storage.blob._MAX_MULTIPART_SIZE = 5 * 1024 * 1024 # 5 MB
storage.blob._DEFAULT_CHUNKSIZE = 5 * 1024 * 1024 # 5 MB