Principle:Duckdb Duckdb Prerequisite Code Generation
Overview
Ensuring all generated source files are up-to-date before packaging or amalgamation. DuckDB relies on a suite of Python-based code generation scripts that produce C and C++ source files from JSON specifications, grammar fragments, enum headers, and template files. These generated files must be current before any amalgamation or source packaging step can proceed.
Description
The DuckDB build process depends on multiple code generation scripts, each responsible for producing specific categories of source files:
| Generator Script | Purpose | Output Category |
|---|---|---|
scripts/generate_c_api.py |
C API header generation | Public C API headers |
scripts/generate_enum.py |
Enum class generation | Enum definitions from JSON specs |
scripts/generate_serialization.py |
Serialization/deserialization code | Serialization routines |
scripts/generate_grammar.py |
SQL grammar production rules | Parser grammar (bison/flex) |
scripts/generate_functions.py |
Built-in function registration | Function catalog entries |
scripts/generate_settings.py |
Configuration settings code | Settings registration |
scripts/generate_metrics.py |
Profiling metrics definitions | Metrics enumeration and helpers |
All of these scripts must be executed successfully before the amalgamation script (scripts/amalgamation.py) or the package build script (scripts/package_build.py) can produce correct output. If any generator is skipped or fails, the resulting amalgamated source will be missing generated code, leading to compilation failures downstream.
The generation step enforces a strict prerequisite ordering: generation runs first, amalgamation runs second, and packaging runs third. This ordering is encoded in CI workflows and Makefile targets.
Usage
This principle applies in the following scenarios:
- Before creating an amalgamated source file -- the amalgamation script reads from
src/andsrc/include/, which contain generated files. These must be fresh. - Before building a source package -- the package build script calls amalgamation internally, so generators must have run.
- As the first step in the packaging pipeline -- CI workflows (e.g.,
.github/workflows/) invoke all generators before any packaging step. - During local development -- developers modifying JSON specs, grammar files, or function definitions must re-run the relevant generators before building.
# Typical invocation order in CI or local builds:
python3 scripts/generate_c_api.py
python3 scripts/generate_enum.py
python3 scripts/generate_serialization.py
python3 scripts/generate_grammar.py
python3 scripts/generate_functions.py
python3 scripts/generate_settings.py
python3 scripts/generate_metrics.py
# Only after all generators succeed:
python3 scripts/amalgamation.py
python3 scripts/package_build.py
Theoretical Basis
This principle is rooted in two foundational concepts:
- Build Prerequisite Ordering
- In any build system, tasks that produce inputs for downstream tasks must complete before those downstream tasks begin. Code generation produces
.cppand.hppfiles that amalgamation reads; therefore, generation is a strict prerequisite of amalgamation.
- Dependency-Driven Generation
- Each generator script reads from a well-defined set of input files (JSON specs, grammar fragments, templates) and writes to a well-defined set of output files. This makes the dependency graph explicit and deterministic. A change to any input file necessitates re-running the corresponding generator to bring outputs up-to-date.