Implementation:Duckdb Duckdb Amalgamation Py
Overview
Concrete tool for creating single-file amalgamated DuckDB source distributions. The amalgamation.py script reads all DuckDB source and header files, resolves include dependencies, deduplicates headers, and writes a single duckdb.cpp and duckdb.hpp that can be compiled independently.
Code Reference
- Source Location
scripts/amalgamation.py(lines 1--608)
Key Functions
| Function | Signature | Purpose |
|---|---|---|
generate_amalgamation |
generate_amalgamation(source_file, header_file) |
Main entry point: produces a single amalgamated source and header file pair. |
generate_amalgamation_splits |
generate_amalgamation_splits(source_file, header_file, nsplits) |
Produces N split source files plus a single header, for parallel compilation. |
get_includes |
get_includes(fpath, text) |
Parses a file's text to extract all #include directives, distinguishing project-internal from system includes.
|
write_file |
write_file(current_file) |
Writes a single source file into the amalgamation output, recursively resolving its includes. |
need_to_write_file |
need_to_write_file(current_file) |
Checks whether a file has already been written to the amalgamation (for deduplication). Returns True if the file still needs to be included.
|
Source Discovery
The script discovers source files by parsing src/CMakeLists.txt recursively. It follows add_subdirectory() directives and collects all .cpp files listed in CMake source lists. Header files are discovered transitively through #include resolution.
I/O Contract
Command-Line Interface
python3 scripts/amalgamation.py [OPTIONS]
Options:
--extended Include extended modules (parquet, jemalloc)
--header-only Generate header-only amalgamation
--splits N Split source into N files for parallel compilation
--linenumbers Include #line directives for debugging (default)
--no-linenumbers Omit #line directives
--header PATH Override output header file path
--source PATH Override output source file path
--list-sources Print list of source files and exit
--list-objects Print list of object files and exit
--includes Print list of include files and exit
--include-directories Print list of include directories and exit
External Dependencies
| Dependency | Purpose |
|---|---|
python3 (3.7+) |
Script runtime |
os, re, sys (stdlib) |
File operations, regex parsing, argument handling |
No third-party Python packages are required.
Inputs
- Source files: all
src/**/*.cppfiles listed in CMakeLists.txt - Header files: all
src/include/**/*.hppfiles reachable through#includeresolution - Third-party headers: headers from
third_party/referenced by DuckDB source - CMakeLists.txt:
src/CMakeLists.txtand all sub-directory CMakeLists.txt files (for source discovery)
Outputs
| Output File | Description |
|---|---|
src/amalgamation/duckdb.cpp |
Single amalgamated C++ source file containing all implementation code |
src/amalgamation/duckdb.hpp |
Single amalgamated C++ header file containing all declarations |
When --splits N is used, the source output becomes duckdb-0.cpp, duckdb-1.cpp, ..., duckdb-(N-1).cpp.
When --extended is used, additional modules (Parquet extension, jemalloc) are included in the amalgamation.
Usage Examples
Basic Amalgamation
# Produce duckdb.cpp and duckdb.hpp in src/amalgamation/
python3 scripts/amalgamation.py
Extended Amalgamation (with Parquet and jemalloc)
# Include parquet reader/writer and jemalloc in the amalgamation
python3 scripts/amalgamation.py --extended
Header-Only Amalgamation
# Produce a header-only amalgamation (all code in the header)
python3 scripts/amalgamation.py --header-only
Split Amalgamation for Parallel Compilation
# Split into 4 source files for parallel compilation
python3 scripts/amalgamation.py --splits 4
# Then compile in parallel:
g++ -std=c++17 -O2 -c src/amalgamation/duckdb-0.cpp &
g++ -std=c++17 -O2 -c src/amalgamation/duckdb-1.cpp &
g++ -std=c++17 -O2 -c src/amalgamation/duckdb-2.cpp &
g++ -std=c++17 -O2 -c src/amalgamation/duckdb-3.cpp &
wait
g++ -o duckdb duckdb-0.o duckdb-1.o duckdb-2.o duckdb-3.o -lpthread -ldl
Custom Output Paths
# Write amalgamation to custom paths
python3 scripts/amalgamation.py --header /tmp/duckdb.hpp --source /tmp/duckdb.cpp
Listing Source Files (Diagnostic)
# List all source files that would be included in the amalgamation
python3 scripts/amalgamation.py --list-sources
# List all include directories
python3 scripts/amalgamation.py --include-directories
Full Pipeline Integration
# Run generators first, then amalgamate
python3 scripts/generate_c_api.py
python3 scripts/generate_enum.py
python3 scripts/generate_serialization.py
python3 scripts/generate_grammar.py
python3 scripts/generate_functions.py
python3 scripts/generate_settings.py
python3 scripts/generate_metrics.py
# Now amalgamate
python3 scripts/amalgamation.py --extended
# Verify compilation
g++ -std=c++17 -O2 -c src/amalgamation/duckdb.cpp