Principle:Duckdb Duckdb Prerequisite Code Generation

Overview

Ensuring all generated source files are up-to-date before packaging or amalgamation. DuckDB relies on a suite of Python-based code generation scripts that produce C and C++ source files from JSON specifications, grammar fragments, enum headers, and template files. These generated files must be current before any amalgamation or source packaging step can proceed.

Description

The DuckDB build process depends on multiple code generation scripts, each responsible for producing specific categories of source files:

Generator Script	Purpose	Output Category
`scripts/generate_c_api.py`	C API header generation	Public C API headers
`scripts/generate_enum.py`	Enum class generation	Enum definitions from JSON specs
`scripts/generate_serialization.py`	Serialization/deserialization code	Serialization routines
`scripts/generate_grammar.py`	SQL grammar production rules	Parser grammar (bison/flex)
`scripts/generate_functions.py`	Built-in function registration	Function catalog entries
`scripts/generate_settings.py`	Configuration settings code	Settings registration
`scripts/generate_metrics.py`	Profiling metrics definitions	Metrics enumeration and helpers

All of these scripts must be executed successfully before the amalgamation script (scripts/amalgamation.py) or the package build script (scripts/package_build.py) can produce correct output. If any generator is skipped or fails, the resulting amalgamated source will be missing generated code, leading to compilation failures downstream.

The generation step enforces a strict prerequisite ordering: generation runs first, amalgamation runs second, and packaging runs third. This ordering is encoded in CI workflows and Makefile targets.

Usage

This principle applies in the following scenarios:

Before creating an amalgamated source file -- the amalgamation script reads from src/ and src/include/, which contain generated files. These must be fresh.
Before building a source package -- the package build script calls amalgamation internally, so generators must have run.
As the first step in the packaging pipeline -- CI workflows (e.g., .github/workflows/) invoke all generators before any packaging step.
During local development -- developers modifying JSON specs, grammar files, or function definitions must re-run the relevant generators before building.

# Typical invocation order in CI or local builds:
python3 scripts/generate_c_api.py
python3 scripts/generate_enum.py
python3 scripts/generate_serialization.py
python3 scripts/generate_grammar.py
python3 scripts/generate_functions.py
python3 scripts/generate_settings.py
python3 scripts/generate_metrics.py

# Only after all generators succeed:
python3 scripts/amalgamation.py
python3 scripts/package_build.py

Theoretical Basis

This principle is rooted in two foundational concepts:

Build Prerequisite Ordering: In any build system, tasks that produce inputs for downstream tasks must complete before those downstream tasks begin. Code generation produces .cpp and .hpp files that amalgamation reads; therefore, generation is a strict prerequisite of amalgamation.

Dependency-Driven Generation: Each generator script reads from a well-defined set of input files (JSON specs, grammar fragments, templates) and writes to a well-defined set of output files. This makes the dependency graph explicit and deterministic. A change to any input file necessitates re-running the corresponding generator to bring outputs up-to-date.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment