Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Duckdb Duckdb SQL Grammar Generation

From Leeroopedia


Overview

Generating SQL parsers and lexers from formal grammar specifications using parser generators. Rather than writing parser code by hand, the SQL grammar is expressed in modular bison (YACC) fragment files and flex scanner rules, which are assembled and then processed by bison and flex to produce C++ parser and scanner source files.

Description

Using bison (parser generator) and flex (lexer generator) to produce C++ parser code from grammar fragment files. The grammar is assembled from modular fragments before generation, allowing SQL syntax to be organized into manageable, per-statement files rather than a single monolithic grammar.

DuckDB's SQL parser is based on PostgreSQL's libpg_query parser, adapted with a modular grammar structure. The generation pipeline has several stages:

  1. Grammar assembly -- a Python script reads a template grammar.y file and performs placeholder substitutions:
    • {{{ GRAMMAR_HEADER }}} is replaced with type declarations from grammar.hpp
    • {{{ GRAMMAR_SOURCE }}} is replaced with C code from grammar.cpp
    • {{{ KEYWORDS }}} is replaced with a generated %token list from keyword list files
    • {{{ STATEMENTS }}} is replaced with the top-level statement rule constructed from statements.list
    • {{{ KEYWORD_DEFINITIONS }}} is replaced with keyword category rules (unreserved, reserved, column name, type name, function name)
    • {{{ TYPES }}} is replaced with bison type declarations concatenated from types/*.yh files
    • {{{ GRAMMAR RULES }}} is replaced with all grammar rules concatenated from statements/*.y files
  2. Keyword management -- keywords are organized into five categories (unreserved, column name, function name, type name, reserved) with validation to prevent duplicates and conflicting classifications
  3. Bison invocation -- the assembled .y file is passed to bison, producing a C++ parser source and header
  4. Flex invocation -- a separate script invokes flex on the scan.l file, producing a C++ scanner, then post-processes it to add namespace wrapping, remove stdin/stdout references, and replace exit() calls with exceptions
  5. Post-processing -- the generated parser source has include paths fixed and compiler warning suppressions added

Usage

This principle applies when modifying SQL syntax or adding new SQL statements. Specific scenarios include:

  • Adding a new SQL statement -- create a new .y grammar fragment file in statements/, add the statement name to statements.list, and re-run the generator
  • Adding a new keyword -- add the keyword to the appropriate category file in keywords/ (unreserved, reserved, etc.) and re-run; the generator validates no duplicates or conflicting classifications
  • Modifying existing grammar rules -- edit the relevant .y fragment and re-run; the generator reassembles the full grammar before invoking bison
  • Debugging shift/reduce conflicts -- re-run with --counterexamples flag (requires bison >= 3.8) to get diagnostic conflict information

Theoretical Basis

  • Formal grammars (BNF/YACC) -- SQL syntax is specified in a formal context-free grammar using bison's YACC-compatible notation, enabling mechanical verification and unambiguous parsing
  • LALR parsing -- bison generates an LALR(1) parser, providing efficient bottom-up parsing with conflict detection at generation time rather than runtime
  • Lexical analysis -- flex generates a DFA-based scanner from regular expression rules, handling tokenization of SQL keywords, identifiers, literals, and operators
  • Modular grammar composition -- splitting the grammar into per-statement fragment files enables independent development of SQL features while the assembly script produces the single file bison requires

Related

Implementation:Duckdb_Duckdb_Generate_Grammar

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment