Principle:Duckdb Duckdb SQL Grammar Generation
Overview
Generating SQL parsers and lexers from formal grammar specifications using parser generators. Rather than writing parser code by hand, the SQL grammar is expressed in modular bison (YACC) fragment files and flex scanner rules, which are assembled and then processed by bison and flex to produce C++ parser and scanner source files.
Description
Using bison (parser generator) and flex (lexer generator) to produce C++ parser code from grammar fragment files. The grammar is assembled from modular fragments before generation, allowing SQL syntax to be organized into manageable, per-statement files rather than a single monolithic grammar.
DuckDB's SQL parser is based on PostgreSQL's libpg_query parser, adapted with a modular grammar structure. The generation pipeline has several stages:
- Grammar assembly -- a Python script reads a template
grammar.yfile and performs placeholder substitutions:{{{ GRAMMAR_HEADER }}}is replaced with type declarations fromgrammar.hpp{{{ GRAMMAR_SOURCE }}}is replaced with C code fromgrammar.cpp{{{ KEYWORDS }}}is replaced with a generated%tokenlist from keyword list files{{{ STATEMENTS }}}is replaced with the top-level statement rule constructed fromstatements.list{{{ KEYWORD_DEFINITIONS }}}is replaced with keyword category rules (unreserved, reserved, column name, type name, function name){{{ TYPES }}}is replaced with bison type declarations concatenated fromtypes/*.yhfiles{{{ GRAMMAR RULES }}}is replaced with all grammar rules concatenated fromstatements/*.yfiles
- Keyword management -- keywords are organized into five categories (unreserved, column name, function name, type name, reserved) with validation to prevent duplicates and conflicting classifications
- Bison invocation -- the assembled
.yfile is passed to bison, producing a C++ parser source and header - Flex invocation -- a separate script invokes flex on the
scan.lfile, producing a C++ scanner, then post-processes it to add namespace wrapping, removestdin/stdoutreferences, and replaceexit()calls with exceptions - Post-processing -- the generated parser source has include paths fixed and compiler warning suppressions added
Usage
This principle applies when modifying SQL syntax or adding new SQL statements. Specific scenarios include:
- Adding a new SQL statement -- create a new
.ygrammar fragment file instatements/, add the statement name tostatements.list, and re-run the generator - Adding a new keyword -- add the keyword to the appropriate category file in
keywords/(unreserved, reserved, etc.) and re-run; the generator validates no duplicates or conflicting classifications - Modifying existing grammar rules -- edit the relevant
.yfragment and re-run; the generator reassembles the full grammar before invoking bison - Debugging shift/reduce conflicts -- re-run with
--counterexamplesflag (requires bison >= 3.8) to get diagnostic conflict information
Theoretical Basis
- Formal grammars (BNF/YACC) -- SQL syntax is specified in a formal context-free grammar using bison's YACC-compatible notation, enabling mechanical verification and unambiguous parsing
- LALR parsing -- bison generates an LALR(1) parser, providing efficient bottom-up parsing with conflict detection at generation time rather than runtime
- Lexical analysis -- flex generates a DFA-based scanner from regular expression rules, handling tokenization of SQL keywords, identifiers, literals, and operators
- Modular grammar composition -- splitting the grammar into per-statement fragment files enables independent development of SQL features while the assembly script produces the single file bison requires