Implementation:Duckdb Duckdb Generate Grammar
Overview
Concrete tool for assembling and generating DuckDB's SQL parser from modular grammar fragments using bison and flex. Two Python scripts orchestrate the process: generate_grammar.py assembles grammar fragments into a complete bison input file and invokes bison, while generate_flex.py invokes flex and post-processes the scanner output.
Code Reference
| Field | Value |
|---|---|
| Source (grammar) | scripts/generate_grammar.py (lines 1--310)
|
| Source (flex) | scripts/generate_flex.py (lines 1--93)
|
| Language | Python 3 |
| API (grammar) | python3 scripts/generate_grammar.py [--bison path] [--counterexamples] [--update] [--namespace ns] [--verbose]
|
| API (flex) | python3 scripts/generate_flex.py [--flex path] [--custom_dir_prefix prefix] [--namespace ns]
|
| External Dependencies | python3, bison (>= 3.0 recommended, >= 3.8 for counterexamples), flex
|
Configuration Constants
| Constant | Default Value | Purpose |
|---|---|---|
bison_location |
"bison" |
Path to the bison executable |
base_dir |
'third_party/libpg_query/grammar' |
Root directory for grammar fragment files |
pg_dir |
'third_party/libpg_query' |
Root directory for libpg_query |
namespace |
'duckdb_libpgquery' |
C++ namespace for generated parser code |
Command-Line Options (generate_grammar.py)
| Option | Description |
|---|---|
--bison=PATH |
Path to bison binary |
--counterexamples |
Enable bison's -Wcounterexamples for debugging shift/reduce conflicts (requires bison >= 3.8)
|
--update |
Pass --update to bison
|
--custom_dir_prefix=PREFIX |
Prefix for source and target directories |
--namespace=NS |
C++ namespace name for generated code |
--verbose |
Enable verbose bison output |
I/O Contract
Inputs: Grammar Fragments
| Input Path | Description |
|---|---|
third_party/libpg_query/grammar/grammar.y |
Template file with placeholder markers for assembly |
third_party/libpg_query/grammar/grammar.hpp |
Grammar header with type declarations |
third_party/libpg_query/grammar/grammar.cpp |
Grammar C source fragments |
third_party/libpg_query/grammar/statements/*.y |
Per-statement grammar rule files (~42 files: select.y, insert.y, create.y, alter_table.y, etc.)
|
third_party/libpg_query/grammar/types/*.yh |
Bison type declaration files |
third_party/libpg_query/grammar/statements.list |
List of top-level statement rule names |
third_party/libpg_query/grammar/keywords/unreserved_keywords.list |
Unreserved SQL keywords |
third_party/libpg_query/grammar/keywords/reserved_keywords.list |
Reserved SQL keywords |
third_party/libpg_query/grammar/keywords/column_name_keywords.list |
Column-name-position keywords |
third_party/libpg_query/grammar/keywords/func_name_keywords.list |
Function-name-position keywords |
third_party/libpg_query/grammar/keywords/type_name_keywords.list |
Type-name-position keywords |
Inputs: Scanner Rules
| Input Path | Description |
|---|---|
third_party/libpg_query/scan.l |
Flex scanner rules defining SQL tokenization |
Outputs
| Output File | Description | Approximate Size |
|---|---|---|
third_party/libpg_query/src_backend_parser_gram.cpp |
Generated LALR parser C++ source | ~33,872 lines |
third_party/libpg_query/include/parser/gram.hpp |
Generated parser header with token definitions | -- |
third_party/libpg_query/include/parser/kwlist.hpp |
Generated keyword list with categories | -- |
third_party/libpg_query/src_backend_parser_scan.cpp |
Generated flex scanner C++ source | ~4,394 lines |
Assembly Process
The grammar assembly in generate_grammar.py follows these steps:
- Read keyword lists from five category files, sort them, and validate for duplicates and conflicting classifications
- Generate keyword structures:
// Generated kwlist.hpp namespace duckdb_libpgquery { const PGScanKeyword ScanKeywords[] = { PG_KEYWORD("abort", ABORT_P, UNRESERVED_KEYWORD) PG_KEYWORD("absolute", ABSOLUTE_P, UNRESERVED_KEYWORD) // ... hundreds more }; const int NumScanKeywords = lengthof(ScanKeywords); }
- Read the template grammar.y and perform placeholder substitutions:
{{{ GRAMMAR_HEADER }}}-- contents ofgrammar.hpp{{{ GRAMMAR_SOURCE }}}-- contents ofgrammar.cpp{{{ KEYWORDS }}}--%tokendeclarations for all keywords{{{ STATEMENTS }}}-- top-levelstmt:rule fromstatements.list{{{ KEYWORD_DEFINITIONS }}}-- keyword category rules{{{ TYPES }}}-- bison type declarations fromtypes/*.yh{{{ GRAMMAR RULES }}}-- all grammar rules fromstatements/*.y
- Write assembled grammar to
grammar.y.tmp - Invoke bison to generate the parser source and header
- Post-process the generated source: fix include paths, suppress compiler warnings
Flex Post-Processing
The generate_flex.py script invokes flex and then applies several transformations to the generated scanner:
- Namespace wrapping -- wraps all generated code in the
duckdb_libpgquerynamespace - Type fix -- changes
int yy_buf_sizetoyy_size_t yy_buf_sizeto suppress warnings - Remove stdio references -- strips
stdinandstdoutreferences for embeddability - Replace exit calls -- converts
exit()tothrow std::runtime_error() - Remove fprintf calls -- comments out
fprintfcalls - Remove register keyword -- strips the deprecated
registerstorage class
Usage Examples
Generate the parser from the repository root:
# Step 1: Assemble grammar and run bison
cd scripts && python3 generate_grammar.py
# Step 2: Run flex to generate the scanner
python3 generate_flex.py
With a custom bison path and counterexample diagnostics:
python3 scripts/generate_grammar.py --bison=/usr/local/bin/bison --counterexamples --verbose
Typical Workflow
- Create or edit a grammar fragment in
third_party/libpg_query/grammar/statements/ - If adding a new statement, add its name to
statements.list - If adding new keywords, add them to the appropriate keyword list file
- Run
python3 scripts/generate_grammar.pyfrom thescripts/directory - If the scanner rules changed, also run
python3 scripts/generate_flex.py - If shift/reduce conflicts arise, re-run with
--counterexamplesfor diagnostic output