Heuristic:MaterializeInc Materialize Docker Image Cache Lookup
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Optimization |
| Last Updated | 2026-02-08 21:00 GMT |
Overview
Multi-tier Docker image existence check strategy that avoids Docker Hub rate limits by preferring API lookups over CLI commands, with local caching via a `known-docker-images.txt` file.
Description
The `is_docker_image_pushed()` and related functions in `mzbuild.py` implement a three-tier strategy for checking whether a Docker image exists in a remote registry: (1) check a local file cache (`known-docker-images.txt`), (2) use the Docker Hub REST API or GHCR token-based API, (3) fall back to `docker manifest inspect` CLI command. The file cache persists across builds within the same workspace, preventing redundant network calls. The API-first approach avoids the Docker Hub rate limit penalty that `docker manifest inspect` incurs even for non-existent images.
Usage
Apply this heuristic when debugging image lookup failures, understanding build cache behavior, or optimizing Docker image build times. It is central to the Mz_Image_Tag_Exists, ResolvedImage_Build, and ResolvedImage_Fingerprint implementations.
The Insight (Rule of Thumb)
- Action: Always check local cache first, then use HTTP API, then fall back to Docker CLI.
- Value: Avoids Docker Hub rate limits (which count `docker manifest inspect` against the pull quota) and reduces network round trips.
- Trade-off: The file cache can become stale if an image is deleted from the registry. However, false positives (cache says exists, but deleted) are rare in practice because images are immutable once published.
- Registry-specific: Docker Hub uses the REST API (`hub.docker.com/v2/repositories/`), while GHCR uses the OCI token endpoint (`ghcr.io/token` + HEAD request to `/v2/.../manifests/`).
Reasoning
Docker Hub rate limits are a significant operational concern for CI systems that check many images per pipeline run. The `docker manifest inspect` command counts against pull rate limits even when the image doesn't exist, making it unsuitable for frequent existence checks. The Docker Hub API (`/v2/repositories/.../tags/`) and GHCR API (`/v2/.../manifests/`) provide a lighter-weight alternative. The local file cache eliminates repeated checks for images that are known to exist from previous steps in the same build.
The fallback to CLI (`docker manifest inspect`) handles edge cases like API downtime (HTTP 401, 429, 500, 502, 503, 504) and non-standard registries.
Code Evidence
Multi-tier lookup from `docker.py:67-113`:
def mz_image_tag_exists(image_tag: str, quiet: bool = False) -> bool:
image_name = f"{image_registry()}/materialized:{image_tag}"
# Tier 1: In-memory cache
if image_name in EXISTENCE_OF_IMAGE_NAMES_FROM_EARLIER_CHECK:
return EXISTENCE_OF_IMAGE_NAMES_FROM_EARLIER_CHECK[image_name]
# Tier 2: Local Docker images
output = subprocess.check_output(
["docker", "images", "--quiet", image_name], ...
)
if output:
EXISTENCE_OF_IMAGE_NAMES_FROM_EARLIER_CHECK[image_name] = True
return True
# Tier 3: Docker Hub API (avoids rate limits)
if image_registry() != "materialize":
return mz_image_tag_exists_cmdline(image_name)
response = requests.get(
f"https://hub.docker.com/v2/repositories/materialize/materialized/tags/{image_tag}"
)
File-based persistent cache from `mzbuild.py:246-264`:
KNOWN_DOCKER_IMAGES_FILE = Path(MZ_ROOT / "known-docker-images.txt")
_known_docker_images: set[str] | None = None
def is_docker_image_pushed(name: str) -> bool:
# ... loads from file, checks in-memory set
if name in _known_docker_images:
return True
# ... falls back to API/CLI, appends to file on success
Rate limit fallback from `mzbuild.py:304-312`:
if response.status_code in (401, 429, 500, 502, 503, 504):
# Fall back to 5x slower method
proc = subprocess.run(
["docker", "manifest", "inspect", name],
stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL,
env=dict(os.environ, DOCKER_CLI_EXPERIMENTAL="enabled"),
)
exists = proc.returncode == 0