Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Duckdb Duckdb Generate Grammar

From Leeroopedia


Overview

Concrete tool for assembling and generating DuckDB's SQL parser from modular grammar fragments using bison and flex. Two Python scripts orchestrate the process: generate_grammar.py assembles grammar fragments into a complete bison input file and invokes bison, while generate_flex.py invokes flex and post-processes the scanner output.

Code Reference

Field Value
Source (grammar) scripts/generate_grammar.py (lines 1--310)
Source (flex) scripts/generate_flex.py (lines 1--93)
Language Python 3
API (grammar) python3 scripts/generate_grammar.py [--bison path] [--counterexamples] [--update] [--namespace ns] [--verbose]
API (flex) python3 scripts/generate_flex.py [--flex path] [--custom_dir_prefix prefix] [--namespace ns]
External Dependencies python3, bison (>= 3.0 recommended, >= 3.8 for counterexamples), flex

Configuration Constants

Constant Default Value Purpose
bison_location "bison" Path to the bison executable
base_dir 'third_party/libpg_query/grammar' Root directory for grammar fragment files
pg_dir 'third_party/libpg_query' Root directory for libpg_query
namespace 'duckdb_libpgquery' C++ namespace for generated parser code

Command-Line Options (generate_grammar.py)

Option Description
--bison=PATH Path to bison binary
--counterexamples Enable bison's -Wcounterexamples for debugging shift/reduce conflicts (requires bison >= 3.8)
--update Pass --update to bison
--custom_dir_prefix=PREFIX Prefix for source and target directories
--namespace=NS C++ namespace name for generated code
--verbose Enable verbose bison output

I/O Contract

Inputs: Grammar Fragments

Input Path Description
third_party/libpg_query/grammar/grammar.y Template file with placeholder markers for assembly
third_party/libpg_query/grammar/grammar.hpp Grammar header with type declarations
third_party/libpg_query/grammar/grammar.cpp Grammar C source fragments
third_party/libpg_query/grammar/statements/*.y Per-statement grammar rule files (~42 files: select.y, insert.y, create.y, alter_table.y, etc.)
third_party/libpg_query/grammar/types/*.yh Bison type declaration files
third_party/libpg_query/grammar/statements.list List of top-level statement rule names
third_party/libpg_query/grammar/keywords/unreserved_keywords.list Unreserved SQL keywords
third_party/libpg_query/grammar/keywords/reserved_keywords.list Reserved SQL keywords
third_party/libpg_query/grammar/keywords/column_name_keywords.list Column-name-position keywords
third_party/libpg_query/grammar/keywords/func_name_keywords.list Function-name-position keywords
third_party/libpg_query/grammar/keywords/type_name_keywords.list Type-name-position keywords

Inputs: Scanner Rules

Input Path Description
third_party/libpg_query/scan.l Flex scanner rules defining SQL tokenization

Outputs

Output File Description Approximate Size
third_party/libpg_query/src_backend_parser_gram.cpp Generated LALR parser C++ source ~33,872 lines
third_party/libpg_query/include/parser/gram.hpp Generated parser header with token definitions --
third_party/libpg_query/include/parser/kwlist.hpp Generated keyword list with categories --
third_party/libpg_query/src_backend_parser_scan.cpp Generated flex scanner C++ source ~4,394 lines

Assembly Process

The grammar assembly in generate_grammar.py follows these steps:

  1. Read keyword lists from five category files, sort them, and validate for duplicates and conflicting classifications
  2. Generate keyword structures:
    // Generated kwlist.hpp
    namespace duckdb_libpgquery {
    const PGScanKeyword ScanKeywords[] = {
        PG_KEYWORD("abort", ABORT_P, UNRESERVED_KEYWORD)
        PG_KEYWORD("absolute", ABSOLUTE_P, UNRESERVED_KEYWORD)
        // ... hundreds more
    };
    const int NumScanKeywords = lengthof(ScanKeywords);
    }
    
  3. Read the template grammar.y and perform placeholder substitutions:
    • {{{ GRAMMAR_HEADER }}} -- contents of grammar.hpp
    • {{{ GRAMMAR_SOURCE }}} -- contents of grammar.cpp
    • {{{ KEYWORDS }}} -- %token declarations for all keywords
    • {{{ STATEMENTS }}} -- top-level stmt: rule from statements.list
    • {{{ KEYWORD_DEFINITIONS }}} -- keyword category rules
    • {{{ TYPES }}} -- bison type declarations from types/*.yh
    • {{{ GRAMMAR RULES }}} -- all grammar rules from statements/*.y
  4. Write assembled grammar to grammar.y.tmp
  5. Invoke bison to generate the parser source and header
  6. Post-process the generated source: fix include paths, suppress compiler warnings

Flex Post-Processing

The generate_flex.py script invokes flex and then applies several transformations to the generated scanner:

  • Namespace wrapping -- wraps all generated code in the duckdb_libpgquery namespace
  • Type fix -- changes int yy_buf_size to yy_size_t yy_buf_size to suppress warnings
  • Remove stdio references -- strips stdin and stdout references for embeddability
  • Replace exit calls -- converts exit() to throw std::runtime_error()
  • Remove fprintf calls -- comments out fprintf calls
  • Remove register keyword -- strips the deprecated register storage class

Usage Examples

Generate the parser from the repository root:

# Step 1: Assemble grammar and run bison
cd scripts && python3 generate_grammar.py

# Step 2: Run flex to generate the scanner
python3 generate_flex.py

With a custom bison path and counterexample diagnostics:

python3 scripts/generate_grammar.py --bison=/usr/local/bin/bison --counterexamples --verbose

Typical Workflow

  1. Create or edit a grammar fragment in third_party/libpg_query/grammar/statements/
  2. If adding a new statement, add its name to statements.list
  3. If adding new keywords, add them to the appropriate keyword list file
  4. Run python3 scripts/generate_grammar.py from the scripts/ directory
  5. If the scanner rules changed, also run python3 scripts/generate_flex.py
  6. If shift/reduce conflicts arise, re-run with --counterexamples for diagnostic output

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment