Principle:ClickHouse ClickHouse SIMD Memory Comparison

Knowledge Sources	ClickHouse
Domains	Memory, SIMD
Last Updated	2026-02-08 00:00 GMT

Overview

Accelerating memory comparison by processing 16 or more bytes simultaneously using vector instructions, combined with allowing safe overflow reads to eliminate edge case branches.

Description

SIMD memory comparison leverages vector processing units to compare multiple bytes in parallel. Instead of comparing byte-by-byte, the algorithm loads 16-byte chunks into SIMD registers and performs parallel equality comparisons. A key innovation is the "allow overflow" approach: by guaranteeing that memory is readable for 15 extra bytes beyond the valid region, the implementation can use aligned SIMD operations without complex edge case handling. This eliminates branches and simplifies the code while maintaining safety through allocator-level guarantees.

The principle combines two optimizations: parallel processing via SIMD and simplified control flow via overflow tolerance. Modern CPUs can compare 16 bytes in a single instruction that would otherwise require 16 comparisons and up to 16 branches. The bit mask output from SIMD comparison instructions directly indicates which bytes differ, enabling fast early-exit using count-trailing-zeros instructions.

Usage

Use SIMD memory comparison for comparing short strings (under 256 bytes), fixed-width database columns, hash values, UUIDs, and other small fixed-size data. This is the right choice when comparison is in a hot path (sorting, hash table lookups, deduplication). Choose standard `memcmp` for large regions where the algorithm difference matters less, or when the overflow guarantee cannot be provided.

Theoretical Basis

Performance Analysis:

Byte-by-byte comparison:

1 load + 1 compare + 1 branch per byte
For 16 bytes: ~48 cycles (with branch mispredictions)
Early exit helps but unpredictable

SIMD comparison (SSE2/NEON):

2 loads (16 bytes each) = ~2 cycles
1 vector compare = ~1 cycle
1 movemask + 1 test = ~2 cycles
Total: ~5 cycles for 16 bytes
Speedup: ~10x for equal strings, ~5x on average

SIMD comparison (AVX-512):

Same as SSE2 but with dedicated comparison mask instructions
Slightly faster: ~4 cycles for 16 bytes
Can process 64 bytes at once: ~8 cycles
Speedup: ~15x for longer strings

The Overflow Trick:

Traditional approach requires handling partial chunks: ``` while (size >= 16) {

   // Compare 16 bytes
   size -= 16;

} // Handle remaining 0-15 bytes with special code ```

With overflow allowance: ``` while (size >= 16) {

   // Compare 16 bytes
   size -= 16;

} // Compare final 16 bytes with overlap (may re-compare some bytes) // No special case needed! ```

The final comparison reads beyond the valid region but the allocator guarantees this memory is readable. This eliminates ~10-20 cycles of branch logic and special case handling.

Why It Works:

Memory allocators typically provide padding:

`malloc` allocates in chunks (often 16-byte aligned)
String allocators often round up to power of 2
Stack arrays often have padding
Most practical strings have exploitable padding

The implementation uses Memory Sanitizer annotations to mark overflow regions as valid, ensuring tools don't report false positives.

Optimization for Special Cases:

16-byte comparison (UUID, hash):

Single SIMD instruction
No loop overhead
3-4 cycles total
10-15x faster than byte-by-byte

Zero-checking:

Compare against zero vector
Same SIMD parallelism
Useful for detecting padding, cleared memory

Platform Variations:

x86 SSE2:

`_mm_cmpeq_epi8` + `_mm_movemask_epi8`
Available on all modern x86-64 CPUs
Produces 16-bit mask

x86 AVX-512:

`_mm_cmp_epi8_mask`
Direct mask comparison
More efficient instruction encoding

ARM NEON:

`vceqq_u8` + nibble mask extraction
Similar performance to SSE2
Different instruction encoding

Cache Efficiency:

Sequential memory access patterns play well with cache prefetchers. Reading 16 bytes at a time aligns with cache line boundaries (64 bytes), improving throughput. The SIMD registers don't use cache capacity, further improving efficiency.

Related Pages

Implementation:ClickHouse_ClickHouse_MemcmpSmall

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment