Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:ClickHouse ClickHouse Submodule Initialization

From Leeroopedia


Knowledge Sources
Domains Build_System, C++, Version_Control
Last Updated 2026-02-08 00:00 GMT

Overview

Git submodule management is a technique for embedding external source code repositories within a monorepo, enabling large C++ projects to vendor and control all third-party dependencies at precise commit revisions.

Description

Large-scale C++ projects such as ClickHouse depend on dozens to hundreds of third-party libraries. Rather than relying on system-installed packages (which vary across distributions and versions), or on package managers that may not cover all needed libraries, the project vendors dependencies by including their full source code as Git submodules inside a contrib/ directory.

Each submodule is pinned to a specific commit SHA1 in the parent repository's tree. This guarantees reproducible builds: every developer and CI agent builds against the exact same version of every dependency, regardless of upstream changes. The parent repository's .gitmodules file records the URL and path for each submodule, while the parent tree itself records the pinned commit.

Because ClickHouse vendors approximately 90 or more libraries, a full recursive clone with complete history would be prohibitively large and slow. Shallow cloning addresses this by fetching only a single commit of history per submodule (--depth=1), and single-branch mode avoids fetching refs for branches the project does not use. Together these optimizations reduce clone times from tens of minutes to a few minutes.

An additional concern arises from the fact that many vendored libraries ship their own CMakeLists.txt files and .cmake modules. If these were left in place, CMake could inadvertently pick them up, causing conflicts with the project's own carefully written CMake wrappers. The solution is to delete all CMake files from submodule directories after checkout, except for specific submodules whose native build systems are intentionally used (such as llvm-project, corrosion, and rust_vendor).

Usage

Use submodule-based dependency vendoring when:

  • The project needs exact version pinning for reproducibility across all build environments.
  • System package managers do not provide the required library versions or build configurations.
  • The project applies custom patches or build wrappers to third-party code.
  • CI pipelines require fast, deterministic dependency setup without network-dependent package resolution.

This technique is most appropriate for projects that compile dependencies from source and need full control over compiler flags, optimization levels, and build-time configuration of each dependency.

Theoretical Basis

Git submodules work by storing a gitlink entry in the parent repository's tree object. A gitlink is a special tree entry that records the commit SHA1 of the submodule rather than a blob or subtree. The .gitmodules file, tracked as a regular file, provides the mapping from submodule path to its remote URL.

The initialization and update process follows this sequence:

1. git submodule init       -- Register submodule paths from .gitmodules into .git/config
2. git submodule sync       -- Synchronize remote URLs from .gitmodules to .git/config
3. git submodule update     -- Fetch and checkout each submodule at its pinned commit
   --depth=1                -- Shallow fetch: only one commit of history
   --single-branch          -- Only fetch the branch containing the pinned commit
4. Delete CMake files       -- Remove CMakeLists.txt and *.cmake from submodule dirs
                               (excluding *.h.cmake template files)
                               (excluding llvm-project, corrosion, rust_vendor)

The parallel execution model uses xargs --max-procs N to run up to N submodule updates concurrently. Since each submodule is an independent Git repository with its own network fetch, parallelism provides near-linear speedup limited only by network bandwidth and Git server concurrency.

The CMake file deletion step uses find to locate files matching CMakeLists.txt or *.cmake (excluding *.h.cmake template headers) and removes them. This ensures that when the parent project's CMake configuration traverses contrib/, it encounters only the project's own CMake wrapper files (stored in directories like contrib/openssl-cmake/) rather than the upstream library's native CMake files.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment