Principle:Duckdb Duckdb JSON Processing
| Knowledge Sources | |
|---|---|
| Domains | Data_Format, Serialization, Text_Processing |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
High-performance parsing and manipulation of JSON (JavaScript Object Notation) documents using an in-memory tree representation that supports both read-only and mutable access patterns.
Description
JSON (JavaScript Object Notation) is a lightweight, text-based data interchange format that has become the de facto standard for web APIs, configuration files, and semi-structured data storage. JSON processing encompasses parsing JSON text into an in-memory representation, querying and extracting values from that representation, and serializing the representation back to text.
High-performance JSON parsing is a non-trivial problem because JSON must handle variable-length strings with escape sequences, arbitrary nesting depth, and multiple numeric formats. Modern JSON parsers achieve high throughput through several techniques: SIMD-accelerated scanning for structural characters (braces, brackets, commas, colons), branchless number parsing, and memory pool allocation to reduce malloc overhead.
The in-memory representation typically uses either a DOM (Document Object Model) approach, where the entire document is parsed into a tree of typed nodes, or a SAX (Simple API for XML) streaming approach that emits events for each element. The DOM approach provides random access to any part of the document, while the SAX approach uses constant memory. High-performance libraries like yyjson use an optimized DOM model with contiguous memory layout and immutable (read-only) document representation for maximum cache efficiency, alongside a separate mutable document type for modification operations.
Usage
JSON processing is used extensively in DuckDB through its JSON extension and native JSON type support. DuckDB can parse JSON documents, extract fields using JSONPath-like syntax (`json_extract`, `->`, `->>`), convert between JSON and relational types, and read newline-delimited JSON (NDJSON) files as tables. The high-performance parser enables DuckDB to process JSON at speeds approaching those of native columnar formats.
Theoretical Basis
JSON Grammar: The recursive structure of JSON:
// JSON value types (RFC 8259)
value = object | array | string | number | "true" | "false" | "null"
object = '{' [ member (',' member)* ] '}'
member = string ':' value
array = '[' [ value (',' value)* ] ']'
string = '"' characters '"'
number = [ '-' ] int [ frac ] [ exp ]
int = '0' | digit1-9 digits
frac = '.' digits
exp = ('e'|'E') ['+' | '-'] digits
DOM Tree Representation: In-memory document model:
// Node types in the document tree
Node = {
type: object | array | string | number | bool | null
value: union {
object: list of (key, value) pairs
array: list of values
string: UTF-8 byte sequence
number: int64 or double
bool: true or false
}
}
// Contiguous memory layout for cache efficiency:
// All nodes stored in a flat array, children referenced by index
// Strings stored in a separate contiguous buffer
// This avoids pointer-chasing and reduces allocator pressure
SIMD-Accelerated Structural Scanning:
// Find structural characters using SIMD (conceptual)
function find_structural_chars(input, length):
structural_mask = 0
for each 16/32-byte chunk:
// Compare against structural characters simultaneously
eq_brace_open = simd_cmpeq(chunk, '{')
eq_brace_close = simd_cmpeq(chunk, '}')
eq_bracket_open = simd_cmpeq(chunk, '[')
eq_bracket_close = simd_cmpeq(chunk, ']')
eq_comma = simd_cmpeq(chunk, ',')
eq_colon = simd_cmpeq(chunk, ':')
eq_quote = simd_cmpeq(chunk, '"')
// Combine masks
structural = eq_brace_open | eq_brace_close | ...
// Handle string interiors (between quotes) separately
in_string = compute_string_mask(eq_quote, eq_backslash)
structural &= ~in_string
Number Parsing: Fast path for integer and float conversion:
function parse_number(input):
// Fast path: small integers
if is_simple_integer(input):
return fast_atoi(input) // branchless digit accumulation
// General path: floating point
// 1. Parse sign, integer part, fraction, exponent
// 2. Use fast algorithm (Eisel-Lemire or similar)
// to convert decimal to IEEE 754 double
// 3. Fall back to exact conversion if needed