Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:DataTalksClub Data engineering zoomcamp GCS Upload Timeout Workaround

From Leeroopedia




Knowledge Sources
Domains Cloud_Infrastructure, Debugging
Last Updated 2026-02-09 07:00 GMT

cohorts/2022/week_2_data_ingestion/airflow/dags/data_ingestion_gcs_dag.py 03-data-warehouse/extras/web_to_gcs.py

Overview

Workaround for Google Cloud Storage upload timeouts on files larger than 6MB by reducing multipart upload size and chunk size to 5MB, preventing failures on slow connections (800 kbps upload speed).

Description

The Google Cloud Storage Python client library has a default multipart upload threshold and chunk size that can cause timeouts on slow network connections. When uploading files larger than 6MB at speeds around 800 kbps, the default settings cause the upload to timeout before a chunk completes. The workaround overrides two internal constants in the `google.cloud.storage.blob` module to use 5MB chunks instead of the defaults.

Usage

Use this heuristic when uploading files to GCS and experiencing timeout errors, particularly on connections with upload speeds below 1 Mbps. This is a known issue documented in the `googleapis/python-storage` GitHub repository (issue #74).

The Insight (Rule of Thumb)

  • Action: Override the GCS blob module's internal multipart and chunk size constants before uploading.
  • Value: Set both `_MAX_MULTIPART_SIZE` and `_DEFAULT_CHUNKSIZE` to `5 * 1024 * 1024` (5 MB).
  • Trade-off: Smaller chunks mean more HTTP requests per upload but prevent timeouts on slow connections. On fast connections, this is unnecessary overhead.

Reasoning

The default `_MAX_MULTIPART_SIZE` in the GCS Python client is larger than 5MB, causing the library to attempt single-request uploads for medium-sized files. On slow connections (~800 kbps), these single requests timeout before the data is fully transferred. By forcing 5MB chunks, each individual HTTP request completes within the timeout window. The workaround was documented in a GitHub issue (googleapis/python-storage#74) and has been adopted by the community as a pragmatic fix.

At 800 kbps upload speed, uploading a full month of taxi data can take approximately 20 minutes.

Code Evidence

Workaround from `data_ingestion_gcs_dag.py:41-45`:

# WORKAROUND to prevent timeout for files > 6 MB on 800 kbps upload speed.
# (Ref: https://github.com/googleapis/python-storage/issues/74)
storage.blob._MAX_MULTIPART_SIZE = 5 * 1024 * 1024  # 5 MB
storage.blob._DEFAULT_CHUNKSIZE = 5 * 1024 * 1024  # 5 MB

Upload speed note from `data_ingestion_gcs_dag.py:32`:

# NOTE: takes 20 mins, at an upload speed of 800kbps. Faster if your internet has a better upload speed

Same workaround from `web_to_gcs.py:24-27`:

# WORKAROUND to prevent timeout for files > 6 MB on 800 kbps upload speed.
# (Ref: https://github.com/googleapis/python-storage/issues/74)
storage.blob._MAX_MULTIPART_SIZE = 5 * 1024 * 1024  # 5 MB
storage.blob._DEFAULT_CHUNKSIZE = 5 * 1024 * 1024  # 5 MB

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment