Principle:Duckdb Duckdb Fast Number Parsing
| Knowledge Sources | |
|---|---|
| Domains | Parsing, Numeric_Computation |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
High-performance algorithms for converting string representations of numbers to their binary floating-point or integer equivalents, significantly outperforming standard library functions like strtod.
Description
Number parsing -- converting strings like "3.14159" to their IEEE 754 floating-point representation -- is a surprisingly complex operation that appears in many data processing contexts. Standard library functions (strtod, strtof, atoi) are designed for correctness and locale support but are not optimized for throughput. In database systems that ingest CSV files or parse text-based formats, number parsing can become a significant bottleneck.
Fast number parsing algorithms achieve multi-gigabyte-per-second throughput through several techniques. The Eisel-Lemire algorithm provides a fast path for float parsing that handles the vast majority of real-world inputs in approximately 10-15 nanoseconds. It works by computing the exact binary representation using 128-bit arithmetic, avoiding the expensive arbitrary-precision arithmetic required by naive approaches. For the rare cases where the fast path cannot determine the correct rounding, it falls back to a slower exact algorithm.
The key insight is that most decimal floating-point numbers can be converted exactly using a single 64x64-bit multiplication followed by a power-of-10 lookup table. The algorithm pre-computes high-precision representations of powers of 10 and uses them to convert the decimal mantissa and exponent to a binary mantissa and exponent in constant time. Integer parsing uses similar optimizations: processing multiple digits simultaneously using SWAR (SIMD Within A Register) techniques and branchless digit counting.
Usage
Fast number parsing is used throughout DuckDB whenever text must be converted to numeric types. This includes CSV file reading (one of the most common data ingestion paths), CAST operations from VARCHAR to numeric types, JSON number parsing, and parsing of numeric literals in SQL statements. The performance of number parsing directly affects the speed of data loading and text processing operations.
Theoretical Basis
Eisel-Lemire Fast Path: Converting decimal to binary float:
// Input: decimal mantissa w, decimal exponent q
// Output: IEEE 754 double (sign * w * 10^q)
function fast_float_parse(w, q):
// Step 1: Lookup pre-computed 128-bit power of 10
(hi, lo) = power_of_ten_table[q + 348] // 128-bit value
// Step 2: Multiply mantissa by power of 10 (128-bit multiply)
product_hi, product_lo = full_multiply_128(w, hi)
// Step 3: Compute binary exponent
binary_exp = floor(log2(10) * q) + 64 + clz(product_hi)
// Step 4: Check if result is exact (no rounding ambiguity)
if lower_bits_are_zero(product_lo):
return construct_double(sign, binary_exp, product_hi >> shift)
else:
// Rare case: fall back to exact algorithm
return exact_conversion(w, q, sign)
Fast Integer Parsing: Branchless multi-digit processing:
// Parse 8 digits at once using SWAR
function parse_8_digits(chars):
// Load 8 ASCII bytes as a 64-bit integer
val = load_u64(chars)
// Subtract '0' from each byte
val = val - 0x3030303030303030
// Multiply and accumulate using SWAR
// Combine pairs: (d0*10+d1, d2*10+d3, d4*10+d5, d6*10+d7)
val = (val * 0x000F424000000064 + (val >> 8) * 0x0000271000000001)
// Extract final result
return (val >> 32) + (val & 0xFFFFFFFF) * 10000
Digit Counting: Determine string length without branches:
// Count digits in an integer using log10 approximation
function count_digits(n):
// Uses the fact that floor(log10(n)) = floor(log2(n) * 77/256)
if n == 0: return 1
bits = 64 - leading_zeros(n)
approx = (bits * 77) >> 8
// Correct for off-by-one using comparison table
return approx + (n >= power_of_10_table[approx] ? 1 : 0)
Correctness Guarantee:
// The fast path handles ~99% of inputs correctly
// For the remaining ~1%, fallback algorithms guarantee:
// 1. Correct rounding to nearest (ties to even)
// 2. Exact conversion for all valid IEEE 754 inputs
// 3. Proper handling of denormals, infinity, NaN
// Fallback uses extended precision (e.g., 192-bit arithmetic)