Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Duckdb Duckdb Fast Number Parsing

From Leeroopedia


Knowledge Sources
Domains Parsing, Numeric_Computation
Last Updated 2026-02-07 12:00 GMT

Overview

High-performance algorithms for converting string representations of numbers to their binary floating-point or integer equivalents, significantly outperforming standard library functions like strtod.

Description

Number parsing -- converting strings like "3.14159" to their IEEE 754 floating-point representation -- is a surprisingly complex operation that appears in many data processing contexts. Standard library functions (strtod, strtof, atoi) are designed for correctness and locale support but are not optimized for throughput. In database systems that ingest CSV files or parse text-based formats, number parsing can become a significant bottleneck.

Fast number parsing algorithms achieve multi-gigabyte-per-second throughput through several techniques. The Eisel-Lemire algorithm provides a fast path for float parsing that handles the vast majority of real-world inputs in approximately 10-15 nanoseconds. It works by computing the exact binary representation using 128-bit arithmetic, avoiding the expensive arbitrary-precision arithmetic required by naive approaches. For the rare cases where the fast path cannot determine the correct rounding, it falls back to a slower exact algorithm.

The key insight is that most decimal floating-point numbers can be converted exactly using a single 64x64-bit multiplication followed by a power-of-10 lookup table. The algorithm pre-computes high-precision representations of powers of 10 and uses them to convert the decimal mantissa and exponent to a binary mantissa and exponent in constant time. Integer parsing uses similar optimizations: processing multiple digits simultaneously using SWAR (SIMD Within A Register) techniques and branchless digit counting.

Usage

Fast number parsing is used throughout DuckDB whenever text must be converted to numeric types. This includes CSV file reading (one of the most common data ingestion paths), CAST operations from VARCHAR to numeric types, JSON number parsing, and parsing of numeric literals in SQL statements. The performance of number parsing directly affects the speed of data loading and text processing operations.

Theoretical Basis

Eisel-Lemire Fast Path: Converting decimal to binary float:

// Input: decimal mantissa w, decimal exponent q
// Output: IEEE 754 double (sign * w * 10^q)

function fast_float_parse(w, q):
    // Step 1: Lookup pre-computed 128-bit power of 10
    (hi, lo) = power_of_ten_table[q + 348]  // 128-bit value

    // Step 2: Multiply mantissa by power of 10 (128-bit multiply)
    product_hi, product_lo = full_multiply_128(w, hi)

    // Step 3: Compute binary exponent
    binary_exp = floor(log2(10) * q) + 64 + clz(product_hi)

    // Step 4: Check if result is exact (no rounding ambiguity)
    if lower_bits_are_zero(product_lo):
        return construct_double(sign, binary_exp, product_hi >> shift)
    else:
        // Rare case: fall back to exact algorithm
        return exact_conversion(w, q, sign)

Fast Integer Parsing: Branchless multi-digit processing:

// Parse 8 digits at once using SWAR
function parse_8_digits(chars):
    // Load 8 ASCII bytes as a 64-bit integer
    val = load_u64(chars)

    // Subtract '0' from each byte
    val = val - 0x3030303030303030

    // Multiply and accumulate using SWAR
    // Combine pairs: (d0*10+d1, d2*10+d3, d4*10+d5, d6*10+d7)
    val = (val * 0x000F424000000064 + (val >> 8) * 0x0000271000000001)
    // Extract final result
    return (val >> 32) + (val & 0xFFFFFFFF) * 10000

Digit Counting: Determine string length without branches:

// Count digits in an integer using log10 approximation
function count_digits(n):
    // Uses the fact that floor(log10(n)) = floor(log2(n) * 77/256)
    if n == 0: return 1
    bits = 64 - leading_zeros(n)
    approx = (bits * 77) >> 8
    // Correct for off-by-one using comparison table
    return approx + (n >= power_of_10_table[approx] ? 1 : 0)

Correctness Guarantee:

// The fast path handles ~99% of inputs correctly
// For the remaining ~1%, fallback algorithms guarantee:
// 1. Correct rounding to nearest (ties to even)
// 2. Exact conversion for all valid IEEE 754 inputs
// 3. Proper handling of denormals, infinity, NaN
// Fallback uses extended precision (e.g., 192-bit arithmetic)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment