Workflow:Duckdb Duckdb Code Generation Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Database_Engineering, Code_Generation, Meta_Programming |
| Last Updated | 2026-02-07 11:00 GMT |
Overview
End-to-end process for running DuckDB's code generation pipeline, which uses Python scripts to produce C/C++ source files from JSON specifications, grammar definitions, and schema descriptors.
Description
This workflow covers DuckDB's extensive meta-programming system that generates C++ code from declarative specifications. Over 15 Python scripts produce code for the C API headers, enum string conversions, serialization/deserialization routines, SQL grammar parser, function registrations, settings, storage versioning, profiling metrics, and PEG grammar transformers. This approach reduces boilerplate, ensures consistency across the codebase, and makes it straightforward to add new types, functions, or settings without manually writing repetitive code.
Usage
Execute this workflow when adding new SQL functions, creating new enum types, modifying the SQL grammar, adding C API functions, changing serialization formats, adding new configuration settings, or updating storage version compatibility. Also required after modifying any JSON specification file or grammar definition that feeds into the code generation pipeline.
Execution Steps
Step 1: Generate C API Headers
Parse JSON definition files in the header_generation directory and produce the C API header files: duckdb.h (main C header for linking), duckdb_extension.h (extension development header), duckdb_go_extension.h (Go extension header), and extension_api.hpp (internal extension API). The generator resolves function groups, parameter types, versioning, and documentation comments.
Key considerations:
- Definition files are in src/include/duckdb/main/capi/header_generation/
- Function definitions are in JSON files organized by functional group
- The extension API struct maps function pointers for runtime extension loading
- Version tagging controls which functions are available in each API version
Step 2: Generate Enum Utilities
Produce bidirectional string-to-enum and enum-to-string conversion functions for all C++ enum classes in DuckDB. Two scripts work in tandem: generate_enums.py creates the enum class definitions from JSON specifications, and generate_enum_util.py creates the EnumUtil conversion functions by scanning enum headers.
Key considerations:
- Enum definitions live in JSON specification files
- All enum constants must have explicit integer values (verified by verify_enum_integrity.py)
- Generated EnumUtil provides FromString and ToString for each enum
- Changes to enums require re-running both generation scripts
Step 3: Generate Serialization Code
Produce C++ serialization and deserialization methods for internal data structures including expressions, query nodes, logical operators, parsed statements, and table references. The generator reads JSON descriptors that define the fields, types, and inheritance hierarchy of each serializable class, then emits FormatSerialize and FormatDeserialize methods.
Key considerations:
- Serialization descriptors are JSON files defining class fields
- Supports inheritance hierarchies with base/derived class serialization
- Used for query plan caching, client-server communication, and storage
- Backward compatibility is critical and tested by test_serialization_bwc.py
Step 4: Generate SQL Grammar
Assemble the complete SQL parser grammar from modular components and run the Bison parser generator to produce the C++ parser source. Additionally, generate the Flex lexical scanner from the scan.l grammar file. The modular grammar structure allows different SQL statement types to be defined separately.
Key considerations:
- Grammar components are assembled by generate_grammar.py
- Bison produces src_backend_parser_gram.cpp (33,872 lines)
- Flex (via generate_flex.py) produces src_backend_parser_scan.cpp (4,394 lines)
- The grammar extends PostgreSQL syntax with DuckDB-specific additions
Step 5: Generate Function Registrations
Produce C++ header files for scalar and aggregate function registrations from functions.json descriptor files. Also generate the extension_entries.hpp header containing static lookup tables that map extension function names, types, and settings to their providing extension.
Key considerations:
- generate_functions.py processes functions.json files in function subdirectories
- generate_extensions_function.py builds the extension function lookup tables
- Lookup tables enable autoloading of extensions when unrecognized functions are called
- Validates that no duplicate function names exist across extensions
Step 6: Generate Settings And Storage Info
Produce C++ code for the configuration settings system and storage version information. The settings generator creates registration code for all database configuration options. The storage info generator maintains version arrays that track storage format compatibility across DuckDB releases.
Key considerations:
- generate_settings.py delegates to the settings_script module
- generate_storage_info.py reads from a centralized versions.json
- Storage version arrays enable backward/forward compatibility checks
- generate_storage_version.py and generate_plan_storage_version.py produce test databases
Step 7: Generate Auxiliary Code
Produce remaining generated code including profiling metric enums, TPC-DS schema definitions, TPC-H/TPC-DS query constants (embedded as C++ header data), vector size definitions, and PEG grammar transformer implementations for the autocomplete extension.
Key considerations:
- generate_metric_enums.py creates the profiling metrics system
- generate_csv_header.py embeds benchmark queries as C++ constants
- generate_tpcds_schema.py creates schema metadata from PostgreSQL
- generate_vector_sizes.py defines STANDARD_VECTOR_SIZE-dependent arrays
- generate_peg_transformer.py checks coverage of PEG rule transformers