Principle:Duckdb Duckdb Auxiliary Code Generation
Overview
This principle covers generating auxiliary code including profiling metric enums, embedded test data, and compile-time constants. DuckDB uses a collection of smaller code generation scripts to produce supporting infrastructure that does not fall under the larger function registration or settings generation categories.
Description
The Auxiliary Code Generation principle governs a collection of smaller code generation tasks that produce various supporting artifacts for the DuckDB engine:
- Profiling metric enums -- A generator reads a JSON specification of profiling metric types and produces C++ enum definitions along with helper functions for converting between metric names and enum values. This ensures the profiling infrastructure has a consistent, auto-generated set of metric identifiers.
- Embedded test data -- Generators read SQL query files (TPC-H and TPC-DS workloads) and embed them as C++ string constants. This allows the test and benchmark infrastructure to reference standard queries without runtime file I/O.
- TPC-DS schema metadata -- A generator produces C++ headers describing the TPC-DS schema (table names, column definitions) for use in the TPC-DS benchmark harness.
- Vector size constants -- A generator produces a C++ header defining compile-time constants for the DuckDB vector size, which governs the batch size used throughout the vectorized execution engine.
Each of these tasks follows the same overarching pattern: a declarative or file-based input is transformed by a Python script into C++ code that is compiled into the engine.
Usage
Apply this principle when modifying profiling metrics, updating test queries, or changing vector size constants:
- When a new profiling metric is introduced, update the metric type JSON and re-run the metric enum generator.
- When TPC-H or TPC-DS queries are updated, re-run the CSV header and TPC-DS generators to refresh the embedded C++ strings.
- When the default vector size needs to change (e.g., for performance experiments), update the input and re-run the vector size generator.
Theoretical Basis
- Compile-time code embedding -- SQL queries and other text artifacts are embedded as C++ string literals at compile time, eliminating runtime file dependencies and ensuring reproducible benchmarks.
- Constant generation -- Values such as vector sizes and enum discriminants are generated from a single source of truth, preventing accidental inconsistencies when the same constant is referenced in multiple places.
- Test data embedding -- Standard benchmark queries (TPC-H, TPC-DS) are baked into the binary, allowing benchmark and test harnesses to run without external data files.