Implementation:Duckdb Duckdb Generate Grammar

Overview

Concrete tool for assembling and generating DuckDB's SQL parser from modular grammar fragments using bison and flex. Two Python scripts orchestrate the process: generate_grammar.py assembles grammar fragments into a complete bison input file and invokes bison, while generate_flex.py invokes flex and post-processes the scanner output.

Code Reference

Field	Value
Source (grammar)	`scripts/generate_grammar.py` (lines 1--310)
Source (flex)	`scripts/generate_flex.py` (lines 1--93)
Language	Python 3
API (grammar)	`python3 scripts/generate_grammar.py [--bison path] [--counterexamples] [--update] [--namespace ns] [--verbose]`
API (flex)	`python3 scripts/generate_flex.py [--flex path] [--custom_dir_prefix prefix] [--namespace ns]`
External Dependencies	`python3`, `bison` (>= 3.0 recommended, >= 3.8 for counterexamples), `flex`

Configuration Constants

Constant	Default Value	Purpose
`bison_location`	`"bison"`	Path to the bison executable
`base_dir`	`'third_party/libpg_query/grammar'`	Root directory for grammar fragment files
`pg_dir`	`'third_party/libpg_query'`	Root directory for libpg_query
`namespace`	`'duckdb_libpgquery'`	C++ namespace for generated parser code

Command-Line Options (generate_grammar.py)

Option	Description
`--bison=PATH`	Path to bison binary
`--counterexamples`	Enable bison's `-Wcounterexamples` for debugging shift/reduce conflicts (requires bison >= 3.8)
`--update`	Pass `--update` to bison
`--custom_dir_prefix=PREFIX`	Prefix for source and target directories
`--namespace=NS`	C++ namespace name for generated code
`--verbose`	Enable verbose bison output

I/O Contract

Inputs: Grammar Fragments

Input Path	Description
`third_party/libpg_query/grammar/grammar.y`	Template file with placeholder markers for assembly
`third_party/libpg_query/grammar/grammar.hpp`	Grammar header with type declarations
`third_party/libpg_query/grammar/grammar.cpp`	Grammar C source fragments
`third_party/libpg_query/grammar/statements/*.y`	Per-statement grammar rule files (~42 files: `select.y`, `insert.y`, `create.y`, `alter_table.y`, etc.)
`third_party/libpg_query/grammar/types/*.yh`	Bison type declaration files
`third_party/libpg_query/grammar/statements.list`	List of top-level statement rule names
`third_party/libpg_query/grammar/keywords/unreserved_keywords.list`	Unreserved SQL keywords
`third_party/libpg_query/grammar/keywords/reserved_keywords.list`	Reserved SQL keywords
`third_party/libpg_query/grammar/keywords/column_name_keywords.list`	Column-name-position keywords
`third_party/libpg_query/grammar/keywords/func_name_keywords.list`	Function-name-position keywords
`third_party/libpg_query/grammar/keywords/type_name_keywords.list`	Type-name-position keywords

Inputs: Scanner Rules

Input Path	Description
`third_party/libpg_query/scan.l`	Flex scanner rules defining SQL tokenization

Outputs

Output File	Description	Approximate Size
`third_party/libpg_query/src_backend_parser_gram.cpp`	Generated LALR parser C++ source	~33,872 lines
`third_party/libpg_query/include/parser/gram.hpp`	Generated parser header with token definitions	--
`third_party/libpg_query/include/parser/kwlist.hpp`	Generated keyword list with categories	--
`third_party/libpg_query/src_backend_parser_scan.cpp`	Generated flex scanner C++ source	~4,394 lines

Assembly Process

The grammar assembly in generate_grammar.py follows these steps:

Read keyword lists from five category files, sort them, and validate for duplicates and conflicting classifications

Generate keyword structures:

// Generated kwlist.hpp
namespace duckdb_libpgquery {
const PGScanKeyword ScanKeywords[] = {
    PG_KEYWORD("abort", ABORT_P, UNRESERVED_KEYWORD)
    PG_KEYWORD("absolute", ABSOLUTE_P, UNRESERVED_KEYWORD)
    // ... hundreds more
};
const int NumScanKeywords = lengthof(ScanKeywords);
}

Read the template grammar.y and perform placeholder substitutions:
- {{{ GRAMMAR_HEADER }}} -- contents of grammar.hpp
- {{{ GRAMMAR_SOURCE }}} -- contents of grammar.cpp
- {{{ KEYWORDS }}} -- %token declarations for all keywords
- {{{ STATEMENTS }}} -- top-level stmt: rule from statements.list
- {{{ KEYWORD_DEFINITIONS }}} -- keyword category rules
- {{{ TYPES }}} -- bison type declarations from types/*.yh
- {{{ GRAMMAR RULES }}} -- all grammar rules from statements/*.y
Write assembled grammar to grammar.y.tmp
Invoke bison to generate the parser source and header
Post-process the generated source: fix include paths, suppress compiler warnings

Flex Post-Processing

The generate_flex.py script invokes flex and then applies several transformations to the generated scanner:

Namespace wrapping -- wraps all generated code in the duckdb_libpgquery namespace
Type fix -- changes int yy_buf_size to yy_size_t yy_buf_size to suppress warnings
Remove stdio references -- strips stdin and stdout references for embeddability
Replace exit calls -- converts exit() to throw std::runtime_error()
Remove fprintf calls -- comments out fprintf calls
Remove register keyword -- strips the deprecated register storage class

Usage Examples

Generate the parser from the repository root:

# Step 1: Assemble grammar and run bison
cd scripts && python3 generate_grammar.py

# Step 2: Run flex to generate the scanner
python3 generate_flex.py

With a custom bison path and counterexample diagnostics:

python3 scripts/generate_grammar.py --bison=/usr/local/bin/bison --counterexamples --verbose

Typical Workflow

Create or edit a grammar fragment in third_party/libpg_query/grammar/statements/
If adding a new statement, add its name to statements.list
If adding new keywords, add them to the appropriate keyword list file
Run python3 scripts/generate_grammar.py from the scripts/ directory
If the scanner rules changed, also run python3 scripts/generate_flex.py
If shift/reduce conflicts arise, re-run with --counterexamples for diagnostic output

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment