Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Duckdb Duckdb JSON Processing

From Leeroopedia
Revision as of 18:14, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Duckdb_Duckdb_JSON_Processing.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Data_Format, Serialization, Text_Processing
Last Updated 2026-02-07 12:00 GMT

Overview

High-performance parsing and manipulation of JSON (JavaScript Object Notation) documents using an in-memory tree representation that supports both read-only and mutable access patterns.

Description

JSON (JavaScript Object Notation) is a lightweight, text-based data interchange format that has become the de facto standard for web APIs, configuration files, and semi-structured data storage. JSON processing encompasses parsing JSON text into an in-memory representation, querying and extracting values from that representation, and serializing the representation back to text.

High-performance JSON parsing is a non-trivial problem because JSON must handle variable-length strings with escape sequences, arbitrary nesting depth, and multiple numeric formats. Modern JSON parsers achieve high throughput through several techniques: SIMD-accelerated scanning for structural characters (braces, brackets, commas, colons), branchless number parsing, and memory pool allocation to reduce malloc overhead.

The in-memory representation typically uses either a DOM (Document Object Model) approach, where the entire document is parsed into a tree of typed nodes, or a SAX (Simple API for XML) streaming approach that emits events for each element. The DOM approach provides random access to any part of the document, while the SAX approach uses constant memory. High-performance libraries like yyjson use an optimized DOM model with contiguous memory layout and immutable (read-only) document representation for maximum cache efficiency, alongside a separate mutable document type for modification operations.

Usage

JSON processing is used extensively in DuckDB through its JSON extension and native JSON type support. DuckDB can parse JSON documents, extract fields using JSONPath-like syntax (`json_extract`, `->`, `->>`), convert between JSON and relational types, and read newline-delimited JSON (NDJSON) files as tables. The high-performance parser enables DuckDB to process JSON at speeds approaching those of native columnar formats.

Theoretical Basis

JSON Grammar: The recursive structure of JSON:

// JSON value types (RFC 8259)
value   = object | array | string | number | "true" | "false" | "null"
object  = '{' [ member (',' member)* ] '}'
member  = string ':' value
array   = '[' [ value (',' value)* ] ']'
string  = '"' characters '"'
number  = [ '-' ] int [ frac ] [ exp ]
int     = '0' | digit1-9 digits
frac    = '.' digits
exp     = ('e'|'E') ['+' | '-'] digits

DOM Tree Representation: In-memory document model:

// Node types in the document tree
Node = {
    type: object | array | string | number | bool | null
    value: union {
        object: list of (key, value) pairs
        array:  list of values
        string: UTF-8 byte sequence
        number: int64 or double
        bool:   true or false
    }
}

// Contiguous memory layout for cache efficiency:
// All nodes stored in a flat array, children referenced by index
// Strings stored in a separate contiguous buffer
// This avoids pointer-chasing and reduces allocator pressure

SIMD-Accelerated Structural Scanning:

// Find structural characters using SIMD (conceptual)
function find_structural_chars(input, length):
    structural_mask = 0
    for each 16/32-byte chunk:
        // Compare against structural characters simultaneously
        eq_brace_open  = simd_cmpeq(chunk, '{')
        eq_brace_close = simd_cmpeq(chunk, '}')
        eq_bracket_open  = simd_cmpeq(chunk, '[')
        eq_bracket_close = simd_cmpeq(chunk, ']')
        eq_comma  = simd_cmpeq(chunk, ',')
        eq_colon  = simd_cmpeq(chunk, ':')
        eq_quote  = simd_cmpeq(chunk, '"')
        // Combine masks
        structural = eq_brace_open | eq_brace_close | ...
        // Handle string interiors (between quotes) separately
        in_string = compute_string_mask(eq_quote, eq_backslash)
        structural &= ~in_string

Number Parsing: Fast path for integer and float conversion:

function parse_number(input):
    // Fast path: small integers
    if is_simple_integer(input):
        return fast_atoi(input)   // branchless digit accumulation

    // General path: floating point
    // 1. Parse sign, integer part, fraction, exponent
    // 2. Use fast algorithm (Eisel-Lemire or similar)
    //    to convert decimal to IEEE 754 double
    // 3. Fall back to exact conversion if needed

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment