Principle:Duckdb Duckdb PostgreSQL SQL Parsing
| Knowledge Sources | |
|---|---|
| Domains | SQL_Parsing, Compiler_Frontend |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
A SQL parsing system based on PostgreSQL's grammar that converts SQL text into an abstract syntax tree (AST) through lexical analysis and LALR(1) parser generation.
Description
SQL parsing is the process of converting a raw SQL text string into a structured abstract syntax tree (AST) that can be analyzed, transformed, and executed by a database engine. DuckDB uses PostgreSQL's battle-tested SQL grammar as the foundation for its parser, gaining the benefit of decades of refinement and comprehensive SQL coverage.
The parsing pipeline consists of two stages. The lexer (scanner) breaks the input text into a stream of tokens, identifying keywords (SELECT, FROM, WHERE), identifiers, literals (numbers, strings), operators, and punctuation. The lexer handles complexities like quoted identifiers, string escapes, dollar-quoted strings, and multi-character operators.
The parser consumes the token stream and applies grammar rules to build the AST. PostgreSQL's parser uses a LALR(1) grammar processed by Bison (a parser generator), which produces efficient table-driven parsers. The grammar defines the syntactic structure of SQL statements, including SELECT queries, DDL statements, expressions, subqueries, CTEs, window functions, and all other SQL constructs. Each grammar rule has an associated action that constructs AST nodes.
The resulting AST is a tree of typed nodes (e.g., SelectStmt, JoinExpr, ColumnRef) that represents the syntactic structure of the SQL statement. DuckDB then transforms this PostgreSQL-style AST into its own internal representation for further processing by the binder, optimizer, and executor.
Usage
SQL parsing is the entry point for every SQL statement executed by DuckDB. Whether a user submits a simple SELECT query, a complex CTE with window functions, or a DDL statement like CREATE TABLE, the parser converts the text into an AST that the rest of the system can process. The PostgreSQL grammar ensures broad SQL compatibility and reduces the maintenance burden of implementing a grammar from scratch.
Theoretical Basis
Lexical Analysis: Tokenizing SQL input:
// Lexer converts character stream to token stream
Input: "SELECT name FROM users WHERE id > 5"
Tokens:
SELECT -> keyword:SELECT
name -> identifier:"name"
FROM -> keyword:FROM
users -> identifier:"users"
WHERE -> keyword:WHERE
id -> identifier:"id"
> -> operator:GT
5 -> integer_literal:5
// Lexer rules (simplified):
[a-zA-Z_][a-zA-Z0-9_]* -> identifier or keyword (lookup table)
[0-9]+ -> integer literal
'[^']*' -> string literal
"[^"]*" -> quoted identifier
LALR(1) Parsing: Table-driven bottom-up parsing:
// LALR(1) = Look-Ahead LR with 1 token lookahead
// Parser uses a stack of states and symbols
// Grammar rules (simplified subset):
select_stmt: SELECT target_list FROM from_clause where_clause
target_list: target_el | target_list ',' target_el
target_el: expression | expression AS identifier
from_clause: table_ref | from_clause ',' table_ref
where_clause: /* empty */ | WHERE expression
// Parsing actions:
// SHIFT: push token onto stack, advance input
// REDUCE: pop symbols matching rule RHS, push rule LHS
// ACCEPT: parsing complete
// ERROR: syntax error
AST Construction: Grammar actions build tree nodes:
// Each grammar rule has an action that builds AST nodes
select_stmt:
SELECT target_list FROM from_clause where_clause
{
$$ = new SelectStmt()
$$.targetList = $2 // target_list
$$.fromClause = $4 // from_clause
$$.whereClause = $5 // where_clause
}
// Resulting AST for "SELECT name FROM users WHERE id > 5":
SelectStmt
targetList: [ColumnRef("name")]
fromClause: [RangeVar("users")]
whereClause: BoolExpr(GT,
ColumnRef("id"),
IntegerLiteral(5))