Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Duckdb Duckdb Amalgamation Py

From Leeroopedia


Overview

Concrete tool for creating single-file amalgamated DuckDB source distributions. The amalgamation.py script reads all DuckDB source and header files, resolves include dependencies, deduplicates headers, and writes a single duckdb.cpp and duckdb.hpp that can be compiled independently.

Code Reference

Source Location
scripts/amalgamation.py (lines 1--608)

Key Functions

Function Signature Purpose
generate_amalgamation generate_amalgamation(source_file, header_file) Main entry point: produces a single amalgamated source and header file pair.
generate_amalgamation_splits generate_amalgamation_splits(source_file, header_file, nsplits) Produces N split source files plus a single header, for parallel compilation.
get_includes get_includes(fpath, text) Parses a file's text to extract all #include directives, distinguishing project-internal from system includes.
write_file write_file(current_file) Writes a single source file into the amalgamation output, recursively resolving its includes.
need_to_write_file need_to_write_file(current_file) Checks whether a file has already been written to the amalgamation (for deduplication). Returns True if the file still needs to be included.

Source Discovery

The script discovers source files by parsing src/CMakeLists.txt recursively. It follows add_subdirectory() directives and collects all .cpp files listed in CMake source lists. Header files are discovered transitively through #include resolution.

I/O Contract

Command-Line Interface

python3 scripts/amalgamation.py [OPTIONS]

Options:
  --extended            Include extended modules (parquet, jemalloc)
  --header-only         Generate header-only amalgamation
  --splits N            Split source into N files for parallel compilation
  --linenumbers         Include #line directives for debugging (default)
  --no-linenumbers      Omit #line directives
  --header PATH         Override output header file path
  --source PATH         Override output source file path
  --list-sources        Print list of source files and exit
  --list-objects        Print list of object files and exit
  --includes            Print list of include files and exit
  --include-directories Print list of include directories and exit

External Dependencies

Dependency Purpose
python3 (3.7+) Script runtime
os, re, sys (stdlib) File operations, regex parsing, argument handling

No third-party Python packages are required.

Inputs

  • Source files: all src/**/*.cpp files listed in CMakeLists.txt
  • Header files: all src/include/**/*.hpp files reachable through #include resolution
  • Third-party headers: headers from third_party/ referenced by DuckDB source
  • CMakeLists.txt: src/CMakeLists.txt and all sub-directory CMakeLists.txt files (for source discovery)

Outputs

Output File Description
src/amalgamation/duckdb.cpp Single amalgamated C++ source file containing all implementation code
src/amalgamation/duckdb.hpp Single amalgamated C++ header file containing all declarations

When --splits N is used, the source output becomes duckdb-0.cpp, duckdb-1.cpp, ..., duckdb-(N-1).cpp.

When --extended is used, additional modules (Parquet extension, jemalloc) are included in the amalgamation.

Usage Examples

Basic Amalgamation

# Produce duckdb.cpp and duckdb.hpp in src/amalgamation/
python3 scripts/amalgamation.py

Extended Amalgamation (with Parquet and jemalloc)

# Include parquet reader/writer and jemalloc in the amalgamation
python3 scripts/amalgamation.py --extended

Header-Only Amalgamation

# Produce a header-only amalgamation (all code in the header)
python3 scripts/amalgamation.py --header-only

Split Amalgamation for Parallel Compilation

# Split into 4 source files for parallel compilation
python3 scripts/amalgamation.py --splits 4

# Then compile in parallel:
g++ -std=c++17 -O2 -c src/amalgamation/duckdb-0.cpp &
g++ -std=c++17 -O2 -c src/amalgamation/duckdb-1.cpp &
g++ -std=c++17 -O2 -c src/amalgamation/duckdb-2.cpp &
g++ -std=c++17 -O2 -c src/amalgamation/duckdb-3.cpp &
wait
g++ -o duckdb duckdb-0.o duckdb-1.o duckdb-2.o duckdb-3.o -lpthread -ldl

Custom Output Paths

# Write amalgamation to custom paths
python3 scripts/amalgamation.py --header /tmp/duckdb.hpp --source /tmp/duckdb.cpp

Listing Source Files (Diagnostic)

# List all source files that would be included in the amalgamation
python3 scripts/amalgamation.py --list-sources

# List all include directories
python3 scripts/amalgamation.py --include-directories

Full Pipeline Integration

# Run generators first, then amalgamate
python3 scripts/generate_c_api.py
python3 scripts/generate_enum.py
python3 scripts/generate_serialization.py
python3 scripts/generate_grammar.py
python3 scripts/generate_functions.py
python3 scripts/generate_settings.py
python3 scripts/generate_metrics.py

# Now amalgamate
python3 scripts/amalgamation.py --extended

# Verify compilation
g++ -std=c++17 -O2 -c src/amalgamation/duckdb.cpp

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment