Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Duckdb Duckdb Code Generation Pipeline

From Leeroopedia


Knowledge Sources
Domains Database_Engineering, Code_Generation, Meta_Programming
Last Updated 2026-02-07 11:00 GMT

Overview

End-to-end process for running DuckDB's code generation pipeline, which uses Python scripts to produce C/C++ source files from JSON specifications, grammar definitions, and schema descriptors.

Description

This workflow covers DuckDB's extensive meta-programming system that generates C++ code from declarative specifications. Over 15 Python scripts produce code for the C API headers, enum string conversions, serialization/deserialization routines, SQL grammar parser, function registrations, settings, storage versioning, profiling metrics, and PEG grammar transformers. This approach reduces boilerplate, ensures consistency across the codebase, and makes it straightforward to add new types, functions, or settings without manually writing repetitive code.

Usage

Execute this workflow when adding new SQL functions, creating new enum types, modifying the SQL grammar, adding C API functions, changing serialization formats, adding new configuration settings, or updating storage version compatibility. Also required after modifying any JSON specification file or grammar definition that feeds into the code generation pipeline.

Execution Steps

Step 1: Generate C API Headers

Parse JSON definition files in the header_generation directory and produce the C API header files: duckdb.h (main C header for linking), duckdb_extension.h (extension development header), duckdb_go_extension.h (Go extension header), and extension_api.hpp (internal extension API). The generator resolves function groups, parameter types, versioning, and documentation comments.

Key considerations:

  • Definition files are in src/include/duckdb/main/capi/header_generation/
  • Function definitions are in JSON files organized by functional group
  • The extension API struct maps function pointers for runtime extension loading
  • Version tagging controls which functions are available in each API version

Step 2: Generate Enum Utilities

Produce bidirectional string-to-enum and enum-to-string conversion functions for all C++ enum classes in DuckDB. Two scripts work in tandem: generate_enums.py creates the enum class definitions from JSON specifications, and generate_enum_util.py creates the EnumUtil conversion functions by scanning enum headers.

Key considerations:

  • Enum definitions live in JSON specification files
  • All enum constants must have explicit integer values (verified by verify_enum_integrity.py)
  • Generated EnumUtil provides FromString and ToString for each enum
  • Changes to enums require re-running both generation scripts

Step 3: Generate Serialization Code

Produce C++ serialization and deserialization methods for internal data structures including expressions, query nodes, logical operators, parsed statements, and table references. The generator reads JSON descriptors that define the fields, types, and inheritance hierarchy of each serializable class, then emits FormatSerialize and FormatDeserialize methods.

Key considerations:

  • Serialization descriptors are JSON files defining class fields
  • Supports inheritance hierarchies with base/derived class serialization
  • Used for query plan caching, client-server communication, and storage
  • Backward compatibility is critical and tested by test_serialization_bwc.py

Step 4: Generate SQL Grammar

Assemble the complete SQL parser grammar from modular components and run the Bison parser generator to produce the C++ parser source. Additionally, generate the Flex lexical scanner from the scan.l grammar file. The modular grammar structure allows different SQL statement types to be defined separately.

Key considerations:

  • Grammar components are assembled by generate_grammar.py
  • Bison produces src_backend_parser_gram.cpp (33,872 lines)
  • Flex (via generate_flex.py) produces src_backend_parser_scan.cpp (4,394 lines)
  • The grammar extends PostgreSQL syntax with DuckDB-specific additions

Step 5: Generate Function Registrations

Produce C++ header files for scalar and aggregate function registrations from functions.json descriptor files. Also generate the extension_entries.hpp header containing static lookup tables that map extension function names, types, and settings to their providing extension.

Key considerations:

  • generate_functions.py processes functions.json files in function subdirectories
  • generate_extensions_function.py builds the extension function lookup tables
  • Lookup tables enable autoloading of extensions when unrecognized functions are called
  • Validates that no duplicate function names exist across extensions

Step 6: Generate Settings And Storage Info

Produce C++ code for the configuration settings system and storage version information. The settings generator creates registration code for all database configuration options. The storage info generator maintains version arrays that track storage format compatibility across DuckDB releases.

Key considerations:

  • generate_settings.py delegates to the settings_script module
  • generate_storage_info.py reads from a centralized versions.json
  • Storage version arrays enable backward/forward compatibility checks
  • generate_storage_version.py and generate_plan_storage_version.py produce test databases

Step 7: Generate Auxiliary Code

Produce remaining generated code including profiling metric enums, TPC-DS schema definitions, TPC-H/TPC-DS query constants (embedded as C++ header data), vector size definitions, and PEG grammar transformer implementations for the autocomplete extension.

Key considerations:

  • generate_metric_enums.py creates the profiling metrics system
  • generate_csv_header.py embeds benchmark queries as C++ constants
  • generate_tpcds_schema.py creates schema metadata from PostgreSQL
  • generate_vector_sizes.py defines STANDARD_VECTOR_SIZE-dependent arrays
  • generate_peg_transformer.py checks coverage of PEG rule transformers

Execution Diagram

GitHub URL

Workflow Repository