Principle:Duckdb Duckdb Source Amalgamation
Overview
Combining multiple source files into a single compilation unit for simplified distribution and compilation. Source amalgamation is the technique by which DuckDB's hundreds of .cpp and .hpp files are merged into a single duckdb.cpp source file and a single duckdb.hpp header file, enabling end users to compile DuckDB without a build system.
Description
The amalgamation technique concatenates all DuckDB source files into two files:
duckdb.cpp-- a single C++ source file containing all implementation codeduckdb.hpp-- a single C++ header file containing all declarations
This approach provides several critical benefits:
- Simplified Distribution
- Instead of shipping hundreds of source files with a complex directory structure, DuckDB can be distributed as just two files. End users can add these two files to any project and compile directly.
- Single-Compilation-Unit Optimizations
- When the compiler sees the entire codebase in a single translation unit, it can perform whole-program optimization more effectively. Inlining decisions, dead code elimination, and interprocedural analysis all benefit from full visibility.
- Elimination of Build System Requirements
- End users embedding DuckDB do not need CMake, Make, or any other build system. A single compiler invocation suffices:
g++ -std=c++17 -O2 -o duckdb duckdb.cpp -lpthread -ldl
- Header Deduplication
- The amalgamation process tracks which headers have already been included and skips duplicate
#includedirectives. This prevents multiple-definition errors and reduces the final file size.
How It Works
The amalgamation process follows these steps:
- Discover source files by parsing
src/CMakeLists.txtrecursively to find all.cppfiles. - Resolve include ordering by following
#includedirectives depth-first, tracking which files have already been written. - Concatenate source files in dependency order, replacing
#include "..."with the actual file contents (for project-internal includes) and preserving#include <...>for system headers. - Write the output to
src/amalgamation/duckdb.cppandsrc/amalgamation/duckdb.hpp.
Extended Mode
The extended amalgamation (--extended flag) includes additional modules beyond the core:
- Parquet reader/writer -- the Apache Parquet extension
- jemalloc allocator -- the jemalloc memory allocator for improved performance
Split Mode
For build systems that benefit from parallel compilation, the amalgamation can be split into N separate source files (--splits N), each containing a subset of the source. This preserves the distribution simplicity while enabling parallel builds.
Usage
This principle applies when:
- Creating distributable source packages for embedding DuckDB in other projects (e.g., Python bindings, R packages, Node.js addons)
- Preparing release artifacts that will be uploaded to GitHub Releases or package registries
- Building header-only distributions where DuckDB is included directly in another project's source tree
- Optimizing compilation through unity build techniques in CI pipelines
Theoretical Basis
| Concept | Description |
|---|---|
| Single Compilation Unit (SCU) | A technique where all source files are combined into one translation unit, enabling the compiler to see and optimize the entire program at once. |
| Include Resolution and Ordering | Topological sorting of header dependencies to ensure each header is included exactly once, in the correct order relative to its dependents. |
| Header Deduplication | Tracking already-included headers to prevent duplicate definitions, analogous to #pragma once or include guards but applied at the amalgamation level.
|
| Unity Builds | A build technique (used in game engines and large C++ projects) where multiple source files are #include-d into a single file to reduce build times and enable cross-TU optimizations.
|