Principle:Duckdb Duckdb Auxiliary Code Generation

Overview

This principle covers generating auxiliary code including profiling metric enums, embedded test data, and compile-time constants. DuckDB uses a collection of smaller code generation scripts to produce supporting infrastructure that does not fall under the larger function registration or settings generation categories.

Description

The Auxiliary Code Generation principle governs a collection of smaller code generation tasks that produce various supporting artifacts for the DuckDB engine:

Profiling metric enums -- A generator reads a JSON specification of profiling metric types and produces C++ enum definitions along with helper functions for converting between metric names and enum values. This ensures the profiling infrastructure has a consistent, auto-generated set of metric identifiers.
Embedded test data -- Generators read SQL query files (TPC-H and TPC-DS workloads) and embed them as C++ string constants. This allows the test and benchmark infrastructure to reference standard queries without runtime file I/O.
TPC-DS schema metadata -- A generator produces C++ headers describing the TPC-DS schema (table names, column definitions) for use in the TPC-DS benchmark harness.
Vector size constants -- A generator produces a C++ header defining compile-time constants for the DuckDB vector size, which governs the batch size used throughout the vectorized execution engine.

Each of these tasks follows the same overarching pattern: a declarative or file-based input is transformed by a Python script into C++ code that is compiled into the engine.

Usage

Apply this principle when modifying profiling metrics, updating test queries, or changing vector size constants:

When a new profiling metric is introduced, update the metric type JSON and re-run the metric enum generator.
When TPC-H or TPC-DS queries are updated, re-run the CSV header and TPC-DS generators to refresh the embedded C++ strings.
When the default vector size needs to change (e.g., for performance experiments), update the input and re-run the vector size generator.

Theoretical Basis

Compile-time code embedding -- SQL queries and other text artifacts are embedded as C++ string literals at compile time, eliminating runtime file dependencies and ensuring reproducible benchmarks.
Constant generation -- Values such as vector sizes and enum discriminants are generated from a single source of truth, preventing accidental inconsistencies when the same constant is referenced in multiple places.
Test data embedding -- Standard benchmark queries (TPC-H, TPC-DS) are baked into the binary, allowing benchmark and test harnesses to run without external data files.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment