Principle:ClickHouse ClickHouse SIMD Memory Comparison
| Knowledge Sources | |
|---|---|
| Domains | Memory, SIMD |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Accelerating memory comparison by processing 16 or more bytes simultaneously using vector instructions, combined with allowing safe overflow reads to eliminate edge case branches.
Description
SIMD memory comparison leverages vector processing units to compare multiple bytes in parallel. Instead of comparing byte-by-byte, the algorithm loads 16-byte chunks into SIMD registers and performs parallel equality comparisons. A key innovation is the "allow overflow" approach: by guaranteeing that memory is readable for 15 extra bytes beyond the valid region, the implementation can use aligned SIMD operations without complex edge case handling. This eliminates branches and simplifies the code while maintaining safety through allocator-level guarantees.
The principle combines two optimizations: parallel processing via SIMD and simplified control flow via overflow tolerance. Modern CPUs can compare 16 bytes in a single instruction that would otherwise require 16 comparisons and up to 16 branches. The bit mask output from SIMD comparison instructions directly indicates which bytes differ, enabling fast early-exit using count-trailing-zeros instructions.
Usage
Use SIMD memory comparison for comparing short strings (under 256 bytes), fixed-width database columns, hash values, UUIDs, and other small fixed-size data. This is the right choice when comparison is in a hot path (sorting, hash table lookups, deduplication). Choose standard `memcmp` for large regions where the algorithm difference matters less, or when the overflow guarantee cannot be provided.
Theoretical Basis
Performance Analysis:
Byte-by-byte comparison:
- 1 load + 1 compare + 1 branch per byte
- For 16 bytes: ~48 cycles (with branch mispredictions)
- Early exit helps but unpredictable
SIMD comparison (SSE2/NEON):
- 2 loads (16 bytes each) = ~2 cycles
- 1 vector compare = ~1 cycle
- 1 movemask + 1 test = ~2 cycles
- Total: ~5 cycles for 16 bytes
- Speedup: ~10x for equal strings, ~5x on average
SIMD comparison (AVX-512):
- Same as SSE2 but with dedicated comparison mask instructions
- Slightly faster: ~4 cycles for 16 bytes
- Can process 64 bytes at once: ~8 cycles
- Speedup: ~15x for longer strings
The Overflow Trick:
Traditional approach requires handling partial chunks: ``` while (size >= 16) {
// Compare 16 bytes size -= 16;
} // Handle remaining 0-15 bytes with special code ```
With overflow allowance: ``` while (size >= 16) {
// Compare 16 bytes size -= 16;
} // Compare final 16 bytes with overlap (may re-compare some bytes) // No special case needed! ```
The final comparison reads beyond the valid region but the allocator guarantees this memory is readable. This eliminates ~10-20 cycles of branch logic and special case handling.
Why It Works:
Memory allocators typically provide padding:
- `malloc` allocates in chunks (often 16-byte aligned)
- String allocators often round up to power of 2
- Stack arrays often have padding
- Most practical strings have exploitable padding
The implementation uses Memory Sanitizer annotations to mark overflow regions as valid, ensuring tools don't report false positives.
Optimization for Special Cases:
16-byte comparison (UUID, hash):
- Single SIMD instruction
- No loop overhead
- 3-4 cycles total
- 10-15x faster than byte-by-byte
Zero-checking:
- Compare against zero vector
- Same SIMD parallelism
- Useful for detecting padding, cleared memory
Platform Variations:
x86 SSE2:
- `_mm_cmpeq_epi8` + `_mm_movemask_epi8`
- Available on all modern x86-64 CPUs
- Produces 16-bit mask
x86 AVX-512:
- `_mm_cmp_epi8_mask`
- Direct mask comparison
- More efficient instruction encoding
ARM NEON:
- `vceqq_u8` + nibble mask extraction
- Similar performance to SSE2
- Different instruction encoding
Cache Efficiency:
Sequential memory access patterns play well with cache prefetchers. Reading 16 bytes at a time aligns with cache line boundaries (64 bytes), improving throughput. The SIMD registers don't use cache capacity, further improving efficiency.